November 29, 2018November 29, 2018 by Jean-Luc Aufranc (CNXSoft) - 74 Comments

Amazon EC2 A1 Arm Instances Deliver up to 45% Cost Savings over x86 Instances

Just a couple of days ago, Amazon introduced EC2 A1 Arm instances based on custom-designed AWS Graviton processors featuring up to 32 Arm Neoverse cores. Commenters started a discussion about price and the real usefulness of Arm cores compared to x86 cores since the latter are likely to be better optimized, and Amazon Web Services (AWS) pricing for EC2 A1 instances did not seem that attractive to some.

The question whether it makes sense will obviously depend on the workload, and metrics like performance per dollar, and performance per watt. AWS re:Invent 2018 is taking place now, and we are starting to get some answers with Amazon claiming up to 45% reduction in costs.

Amazon ECS2 A1 Cost Savings — Image Source: William Lam – Click to Enlarge

It sounds good, except there’s not much information about the type of workload here. So it would be good if there was an example of company leveraging this type of savings with their actual products or services. It turns SmugMug photo sharing website has migrated to Amazon EC2 A1 Arm instances. Their servers run Ubuntu 18.04 on 64-bit Arm with PHP, Nginx, HAProxy, Puppet, etc…, and it allegedly only took a few minutes to compile some of the required packages for Arm.

Amazon Arm SmugMug — Images Source: Andrea Pellegrini – Click to Enlarge

So at least they managed to migrate from Intel to Arm with everything running reliably on SmugMug website. But how much did they manage to save? Based on the slide below, costs went down by 40% per core for their use case. Impressive, and they also claim running Arm instances feel the same as Intel instances. Having said that, I find it somewhat odd to use “per core” cost savings, as for example if they went from 16-core Intel instances to 32-core Arm instances both going for the same price, the Arm instances would be 50% cheaper per core, assuming similar performance.

Phoronix also ran benchmarks on Amazon EC2 A1 instances, and here the results are quite different. As expected Intel or AMD based systems are still much faster in terms of raw performance, but if you expected a performance per dollar advantage for Arm instances, it’s not there for most workloads.

PHP runs on many servers, and you’d expect Arm to perform reasonably well in terms of performance per dollar, but some Intel instances are nearly three times cheaper here.

7-zip compression benchmark is one of the rare benchmarks were an Arm instance deliver a better performance/cost ratio than competing offerings. This lead Michael Larabel to conclude “at this stage, the Amazon EC2 ARM instances don’t make a lot of sense”.

My conclusion is that whether Arm instances make sense or not highly depends on your workload.

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

74 Replies to “Amazon EC2 A1 Arm Instances Deliver up to 45% Cost Savings over x86 Instances”

dgp says:

November 29, 2018 at 11:43

Details for the 45% saving are a must if you want to take it seriously.
My guess looking at the phoronix benchmarks is that they had many many x86 instances that were far bigger than they needed and they were pouring money down the drain for the sake of having each layer of their stack running in a different instance.
Or Amazon gave them a massive amount of coupons for being an early adopter. 😉

Reply
1. cnxsoft says:
  
  November 29, 2018 at 11:54
  
  I’ve just realized the 40% cost savings for SmugMug is per core, not overall cost savings. So the actual savings are hard to predict/guess. I’ve updated the post to reflect that.
  
  Reply
jqpqbc123 says:

November 29, 2018 at 11:48

Thanks for removing any guilt about sticking with Intel.

Reply
willy says:

November 29, 2018 at 11:57

We still have no info about the core’s frequency, memory bandwidth nor I/O bandwidth. I wouldn’t be surprised at all if these instances were really interesting for network-bound workloads. Today it’s normal to saturate a 10 Gbps link with TCP traffic using a single A72 core, and for this type of workloads you don’t need to pay more than the smallest cores they can provide you, but ideally you’d want real cores and not something that gives you 1 millisecond of CPU every once in a while. I know people who have left AWS because of extreme costs for the low CPU performance they needed. There could typically be an improvement here depending on the hardware capabilities.

Reply
1. tkaiser says:
  
  November 29, 2018 at 15:44
  
  > We still have no info about the core’s frequency, memory bandwidth nor I/O bandwidth
  
  At least memory latency isn’t bad, see the Phoronix 7-zip numbers (maybe the only number generated that’s not totally meaningless for this type of system comparison).
  
  Reply
  1. blu says:
    
    December 2, 2018 at 19:02
    
    Take it with some salt, particularly the latency times, as run’s from an a1.medium single-vCPU instance:
    $ ./tinymembench tinymembench v0.4.9 (simple benchmark for memory throughput and latency)
    
    ==========================================================================
    == Memory bandwidth tests ==
    == ==
    == Note 1: 1MB = 1000000 bytes ==
    == Note 2: Results for 'copy' tests show how many bytes can be ==
    == copied per second (adding together read and writen ==
    == bytes would have provided twice higher numbers) ==
    == Note 3: 2-pass copy means that we are using a small temporary buffer ==
    == to first fetch data into it, and only then write it to the ==
    == destination (source -> L1 cache, L1 cache -> destination) ==
    == Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
    == brackets ==
    ==========================================================================
    
    C copy backwards : 3878.8 MB/s
    C copy backwards (32 byte blocks) : 3869.6 MB/s (0.2%)
    C copy backwards (64 byte blocks) : 3859.0 MB/s
    C copy : 3841.2 MB/s
    C copy prefetched (32 bytes step) : 3675.0 MB/s (0.1%)
    C copy prefetched (64 bytes step) : 3663.2 MB/s (0.1%)
    C 2-pass copy : 4081.5 MB/s
    C 2-pass copy prefetched (32 bytes step) : 4123.1 MB/s
    C 2-pass copy prefetched (64 bytes step) : 4140.0 MB/s
    C fill : 11993.3 MB/s (0.1%)
    C fill (shuffle within 16 byte blocks) : 11990.3 MB/s
    C fill (shuffle within 32 byte blocks) : 11972.4 MB/s
    C fill (shuffle within 64 byte blocks) : 11977.0 MB/s
    ---
    standard memcpy : 3846.7 MB/s (0.1%)
    standard memset : 11977.1 MB/s
    ---
    NEON LDP/STP copy : 3835.0 MB/s (0.1%)
    NEON LDP/STP copy pldl2strm (32 bytes step) : 3376.9 MB/s
    NEON LDP/STP copy pldl2strm (64 bytes step) : 3382.5 MB/s
    NEON LDP/STP copy pldl1keep (32 bytes step) : 3670.6 MB/s (0.1%)
    NEON LDP/STP copy pldl1keep (64 bytes step) : 3647.7 MB/s
    NEON LD1/ST1 copy : 3847.9 MB/s (0.6%)
    NEON STP fill : 11998.0 MB/s (0.2%)
    NEON STNP fill : 12031.9 MB/s (0.9%)
    ARM LDP/STP copy : 3829.7 MB/s
    ARM STP fill : 11973.7 MB/s
    ARM STNP fill : 11956.6 MB/s
    
    ==========================================================================
    == Memory latency test ==
    == ==
    == Average time is measured for random memory accesses in the buffers ==
    == of different sizes. The larger is the buffer, the more significant ==
    == are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
    == accesses. For extremely large buffer sizes we are expecting to see ==
    == page table walk with several requests to SDRAM for almost every ==
    == memory access (though 64MiB is not nearly large enough to experience ==
    == this effect to its fullest). ==
    == ==
    == Note 1: All the numbers are representing extra time, which needs to ==
    == be added to L1 cache latency. The cycle timings for L1 cache ==
    == latency can be usually found in the processor documentation. ==
    == Note 2: Dual random read means that we are simultaneously performing ==
    == two independent memory accesses at a time. In the case if ==
    == the memory subsystem can't handle multiple outstanding ==
    == requests, dual random read has the same timings as two ==
    == single reads performed one after another. ==
    ==========================================================================
    
    block size : single random read / dual random read, [MADV_NOHUGEPAGE]
    1024 : 0.0 ns / 0.0 ns
    2048 : 0.0 ns / 0.0 ns
    4096 : 0.0 ns / 0.0 ns
    8192 : 0.0 ns / 0.0 ns
    16384 : 0.0 ns / 0.0 ns
    32768 : 0.0 ns / 0.0 ns
    65536 : 3.9 ns / 6.1 ns
    131072 : 5.9 ns / 8.1 ns
    262144 : 8.4 ns / 10.7 ns
    524288 : 9.9 ns / 12.5 ns
    1048576 : 13.1 ns / 16.8 ns
    2097152 : 39.5 ns / 55.8 ns
    4194304 : 72.3 ns / 92.8 ns
    8388608 : 93.5 ns / 113.9 ns
    16777216 : 103.0 ns / 120.3 ns
    33554432 : 107.4 ns / 129.1 ns
    67108864 : 127.8 ns / 163.2 ns
    
    block size : single random read / dual random read, [MADV_HUGEPAGE]
    1024 : 0.0 ns / 0.0 ns
    2048 : 0.0 ns / 0.0 ns
    4096 : 0.0 ns / 0.0 ns
    8192 : 0.0 ns / 0.0 ns
    16384 : 0.0 ns / 0.0 ns
    32768 : 0.0 ns / 0.0 ns
    65536 : 3.9 ns / 6.1 ns
    131072 : 5.9 ns / 8.1 ns
    262144 : 6.9 ns / 8.8 ns
    524288 : 7.6 ns / 9.3 ns
    1048576 : 10.1 ns / 12.2 ns
    2097152 : 36.1 ns / 50.8 ns
    4194304 : 69.1 ns / 88.5 ns
    8388608 : 82.7 ns / 97.6 ns
    16777216 : 88.7 ns / 99.9 ns
    33554432 : 91.4 ns / 100.7 ns
    67108864 : 92.1 ns / 101.6 ns
    
    Reply
  2. blu says:
    
    December 2, 2018 at 19:08
    
    And the clocks from the same instance:
    $ ./mhz 16 count=1008816 us50=22074 us250=110548 diff=88474 cpu_MHz=2280.480 count=1008816 us50=22064 us250=110458 diff=88394 cpu_MHz=2282.544 count=1008816 us50=22161 us250=110518 diff=88357 cpu_MHz=2283.500 count=1008816 us50=22166 us250=110488 diff=88322 cpu_MHz=2284.405 count=1008816 us50=22102 us250=110558 diff=88456 cpu_MHz=2280.944 count=1008816 us50=22138 us250=110595 diff=88457 cpu_MHz=2280.918 count=1008816 us50=22087 us250=110558 diff=88471 cpu_MHz=2280.557 count=1008816 us50=22109 us250=110535 diff=88426 cpu_MHz=2281.718 count=1008816 us50=22151 us250=110420 diff=88269 cpu_MHz=2285.776 count=1008816 us50=22138 us250=110518 diff=88380 cpu_MHz=2282.906 count=1008816 us50=22156 us250=110444 diff=88288 cpu_MHz=2285.285 count=1008816 us50=22153 us250=110487 diff=88334 cpu_MHz=2284.094 count=1008816 us50=22162 us250=110477 diff=88315 cpu_MHz=2284.586 count=1008816 us50=22169 us250=110525 diff=88356 cpu_MHz=2283.526 count=1008816 us50=22158 us250=110457 diff=88299 cpu_MHz=2285.000 count=1008816 us50=22121 us250=110537 diff=88416 cpu_MHz=2281.976
    
    Reply
  3. blu says:
    
    December 2, 2018 at 20:14
    
    Here are some SIMD (GEMM) results as well (^f Graviton): https://github.com/blu/gemm/blob/master/README.md
    
    Reply
  4. blu says:
    
    December 2, 2018 at 21:37
    
    And of course brainfuck results (^f Graviton):
    https://github.com/blu/brainstorm/blob/master/README.md
    
    Reply
blu says:

November 29, 2018 at 15:02

> My conclusion is that whether Arm instances make sense or not highly depends on your workload.

And that’s the only reasonable conclusion.

Fun anecdote, yesterday at work I did a quick-n-dirty port of a compute unittest to my CA72 chromebook – at first it started at a significantly lower normalized per-core perf compared to my desktop xeon, but 10 minutes of fixings later it was doing the per-clock per-core prformance of my workstation.

The reason I’m sharing this it to demonstrate it takes a bit more than a successful compile to get the most out of another architecture (and generic benchmarks of the phoronix kind often miss that).

Reply
1. tkaiser says:
  
  November 29, 2018 at 16:07
  
  > generic benchmarks of the phoronix kind often miss that
  
  And that’s just one problem with this Phoronix stuff (generating random numbers due to different compiler flags and (missing) optimizations chosen on some platforms vs. another).
  
  The more severe problem with Phoronix is generating wrong conclusions all the time. Just take his first ‘benchmark’ using PHPBench for example. This is a single threaded set of 56 different tests checking some PHP related performance aspects (some influenced by CPU performance, some more by memory performance — no idea whether IO also interferes). The only insight this benchmark will provide is whether PHP in general is running faster here or there (so as developer you can start your journey to explore the why).
  
  It does not tell anything about whether server A vs. B is performing better with PHP workloads since this PHP stuff usually scales pretty well with count of CPU cores and cpufreq scaling behavior is another thing to consider (certain CPUs clock pretty high with single-threaded workloads but per-core performance drops a lot if all cores are busy at the same time which pretty much describes ‘server situation’ in general).
  
  So while PHPBench being a valuable tool for developers to fine-tune a server (the software stack) it’s not able to provide any meaningful numbers to judge about ‘server performance of PHP based workloads’ in general since it does not take multi-threading (AKA reality) into account at all. You need to test your own workload on this type of server to get an idea how performant it will run.
  
  But it gets even worse. These 100% meaningless numbers (single-threaded execution) are now even used by Michael Larabel to generate weird ‘Performance / Cost’ comparisons. Which cloud customer would be that stupid to move workloads that do not scale whatsoever into the cloud? What’s the purpose of pulling in such single-threaded numbers into ‘Performance / Cost’ comparisons for multi-core systems at all?
  
  Reply
  1. blu says:
    
    November 29, 2018 at 16:27
    
    Your post made me actually go carefully through the phoronix article, and this is .. benchmarking gone terribly wrong.
    
    Reply
    1. FransM says:
      
      November 29, 2018 at 16:53
      
      As I always put it: “there are lies, damned lies and benchmarks” ….
      
      Reply
      1. blu says:
        
        November 29, 2018 at 23:06
        
        Benchmarking done right is actually helpful, but that normally takes way more effort and knowledge than what was put in the majority of benchmark figures out there.
      2. willy says:
        
        November 30, 2018 at 00:01
        
        The only way to do benchmarking right is to openly announce what you’re measuring (and why you believe your method is reasonably relevant). I do benchmarking all the time on haproxy, at least to measure the impacts of certain code changes. It definitely is a requirement. But I’m well placed to know that it’s possible to make your benchmarks say what you want them to say, even by accident, using your own bias, because the same method cannot be used all the time and need to evolve to take new possibilities into consideration. In my opinion, the goal of the benchmark precisely is what is missing there : why exactly are we focusing on these test results instead of any other ones. Note that “no time to run more, please come back later” is a perfectly valid reason, it just needs to be told so that readers don’t jump on numbers out of any context.
      3. blu says:
        
        November 30, 2018 at 00:39
        
        I agree. I profile and benchmark the wazoo out of everything that I find remotely interesting, not simply because it’s my job, but because I find performance awareness fundamental to most things we do in computing. That said, I find most benchmarks that yield a number without looking at the reasons to that number, well, not that useful.
  2. dgp says:
    
    November 29, 2018 at 16:28
    
    >blu
    
    Hand optimising stuff is impractical for most people. If you have time to mess around optimising your whole stack like that you might as well fire a dev and hire a sysadmin to deploy your stuff on dedicated servers that are about a fifth of the price of AWS.
    
    >tkaiser
    
    Whether the numbers are very good or not doesn’t really matter here as long as the tests are like-for-like. The claim from Amazon that it’s 45% more cost efficient. If that were true/generally applicable even bad benchmarks would reflect that in someway but they don’t.
    
    The benchmarks show the ARM instances being significantly slower out of the box so the only way you can explain this 45% number is if they went from lots of under utilised instances to cheaper/fewer instances that are doing more real work.
    
    Reply
    1. tkaiser says:
      
      November 29, 2018 at 16:43
      
      > The claim from Amazon that it’s 45% more cost efficient. If that were true/generally applicable even bad benchmarks would reflect that in someway but they don’t.
      
      While I really don’t care about Amazon’s claims I think by carefully looking at what Michael Larabel provided it’s obvious that these ARM instances perform pretty well if it’s about a ‘performance / cost’ ratio. Of course this requires dropping the 100% BS ‘benchmarks’ (like Apache Benchmark executed in most stupid ‘fire and forget’ mode possible) and where his whole methodology is completely flawed (using single-threaded PHPBench and PyBench numbers for multi-core server instance ‘performance / cost’ comparisons).
      
      Reply
      1. dgp says:
        
        November 29, 2018 at 17:04
        
        >these ARM instances perform pretty well if it’s about a ‘performance / cost’ ratio
        
        How? You pay for time on AWS. The ARM instance used is $.4 an hour and an X86 with half the cores is $.33 an hour. Even if your application scales perfectly and can utilise every core on the ARM instance the X86 is still producing more work in an hour, for $0.07 less with less latency.
        A dedicated server that does the same amount of work costs $0.1 an hour so either AWS instance is actually expensive for the work produced.
        
        >using single-threaded PHPBench and PyBench numbers
        >multi-core server instance ‘performance / cost’ comparisons).
        
        If your application scales perfectly you can just multiply by the number of cores and the ARM instance still comes out worse. (Of course no application scales perfectly). There’s no way of benching marking how well every application will scale so the best you can do is take the single threaded number, multiply it by the number of cores and then by some faction to factor in that it’ll never fully utilise all of of the cores. The chances are that your application doesn’t actually scale all that well and you’re better off having fewer faster cores.
        
        TL;DR;
        
        The only way this works out better value for money is if you have low utilisation and the ARM plans are at a sweet spot where they perform well enough but don’t cost as much as the minimum usable X86 instance but none of the AWS pricing indicates that’s the case.
      2. tkaiser says:
        
        November 29, 2018 at 17:53
        
        > There’s no way of benching marking how well every application will scale so the best you can do is take the single threaded number, multiply it by the number of cores and then by some faction to factor in that it’ll never fully utilise all of of the cores
        
        And now look at the Phoronix numbers that simply forgot to ‘multiply it by the number of cores’. That’s my whole point. Michael Larabel generating fancy graphs based on numbers without meaning and people looking at them and thinking ‘damn right, this other platform sucks’.
        
        The most simple performance metric I use for ‘general server workloads’ is simply 7-zip since this workload depends on CPU and memory performance in reasonable ways. By looking at this multi-threaded workload the Phoronix numbers can be taken for a ‘performance / cost’ ratio and here even according to Larabel and based on Amazon’s pricing the ARM instances easily outperform Intel and AMD. (whether this has anything to do with the behavior of a real workload is a totally different story)
        
        Please have in mind that this ‘performance / cost’ ratio calculation as done by the Phoronix Test Suite couldn’t be more moronic. He calculates with an accuracy of 0.01$ so his ‘performance / cost’ ratios for different setups that vary by almost 2 look identical since both end up with being multiplied with 0.01$ or 0.02$ in his list. And if the result of his ‘calculation’ drops below 0.01$ then whole setups disappear from the graphs (that’s why you can’t find the smallest ARM, AMD and Intel instances in his PHPBench ‘performance / cost’ comparison for single-threaded PHPBench). This flaw is so obvious…
        
        I mean the only reasonable action when running such a benchmark on these platforms is to throw the results away and overthink some fundamentals of the used ‘benchmarking’ toolset. At least never ever use single-threaded numbers is this flawed way for ‘performance / cost’ comparisons and add at least 3 more decimal points for such calculations. None of this happens in Phoronix land — instead this totally misleading BS gets published…
      3. dgp says:
        
        November 29, 2018 at 18:18
        
        >The most simple performance metric I use for ‘general server workloads’
        >is simply 7-zip since this workload depends on CPU and memory performance
        > in reasonable ways.
        
        I don’t know many people running 7-zip-as-a-Service.. but anyhow presumably the performance there is good because there is an optimised filter for armv8. That’s the only benchmark to shows that the ARM instances could get better. There is hope at least.
        Whether or not that optimisation happens for whatever you’re using is another question.
        
        >None of this happens in Phoronix land — instead this totally misleading BS gets published…
        
        The benchmarks are mostly crap but it’s something we can do some comparison with.
        If the vendor is saying you can save almost half of your tens or hundreds of thousand dollar server bill each year you would jump at it. But would you believe some figure the vendor pulled out of their ass like that?
        There would be a lot of work in doing your own benchmarks to work out if this would work for your application. You would essentially have to deploy your stack to it and put your users on it.
        The numbers from phoronix may not be ideal but for me at least they are a signal to maybe take a look next year. Maybe Amazon won’t like the benchmarks either and will come up with some real numbers.
      4. tkaiser says:
        
        November 29, 2018 at 18:45
        
        > I don’t know many people running 7-zip-as-a-Service
        
        None I would assume. But that’s not the point. It’s about how to get a quick idea about the performance to be expected based on a certain workload. And for ‘server tasks’ without taking IO into account not just me found that 7-zip is working fairly well to get a rough estimate. It’s also not that dependent on compiler versions unlike a lot of other popular benchmarks and it provides a routine calculating CPU clockspeeds comparable to Willy’s mhz tool. See the ‘CPU Freq’ line below:
        
        7-Zip (a) [32] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,32 bits,4 CPUs LE)
        
        LE
        CPU Freq: 1034 1093 1093 1093 1092 1088 1093 1093 1093
        
        RAM size: 1000 MB, # CPU hardware threads: 4
        RAM usage: 882 MB, # Benchmark threads: 4
        
        Compressing | Decompressing
        Dict Speed Usage R/U Rating | Speed Usage R/U Rating
        KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
        
        22: 1500 313 467 1460 | 41820 393 908 3568
        23: 1467 316 472 1495 | 41104 393 905 3557
        24: 1458 324 484 1569 | 40288 394 899 3537
        25: 1444 333 495 1649 | 38477 391 877 3424
        ---------------------------------- | ------------------------------
        Avr: 322 479 1543 | 392 897 3521
        Tot: 357 688 2532
        
        This slow ARM thingy scores 2,500 and my Core i7 laptop with HT enabled scores 18,000 (1:7.2 ratio). When testing with some real workloads that all not depend on IO we get a performance difference of 1:5.6 up to 1:8 between those two hardwares so these 1:7.2 above fits good enough.
        
        When relying on the stuff Phoronix suggests the slow ARM thingy is almost 12 times slower than the Core i7:
        
        System : Linux monit 4.14.78-sunxi #412 SMP Fri Oct 26 11:37:04 CEST 2018 armv7l PHP version: 7.2.10-0ubuntu0.18.04.1 PHPBench : 0.8.1 Date : November 29, 2018, 9:11 am Tests : 56 Iterations : 100000 Total time : 31 seconds Score : 32512 (higher is better)
        
        Versus
        
        System : Darwin mac-tk-2018 17.7.0 Darwin Kernel Version 17.7.0: Wed Oct 10 23:06:14 PDT 2018; root:xnu-4570.71.13~1/RELEASE_X86_64 x86_64 PHP version: 7.1.16 PHPBench : 0.8.1 Date : November 29, 2018, 8:07 am Tests : 56 Iterations : 100000 Total time : 3 seconds Score : 377366 (higher is better)
        
        It should be obvious that using this PHPBench score is 100% wrong to compare these two systems even with heavy PHP use in mind. But the opposite happens: Phoronix suggests these single-threaded numbers would be relevant and even adds it in a totally bogus way to some ‘performance / cost’ comparisons. And people obviously rely on this and start to draw conclusions.
      5. dgp says:
        
        November 29, 2018 at 19:12
        
        I think the 7-zip result being so good is highly dependant on it being optimised for ARMv8 so I’m not sure if that’s a good bench mark for general performance i.e. for any old package I would install in ubuntu.
        If PHPbench is throwing random bits of PHP at the interpreter and working through a lot of it’s code then that would be a better representation of overall performance. Benchmarks need to be taken with a pinch of salt of course. I’m sure the difference isn’t actually that bad but based on what I’ve seen I’m not convinced it’s anywhere near head to head.
        
        Either way; it doesn’t matter. Unless the ARM instances are faster per core then the X86 instances they are no better value for money for the user. They might be cheaper for Amazon to run but their pricing doesn’t reflect that.
        
        So unless you need to run ARMv8 code what’s the point?
      6. tkaiser says:
        
        November 29, 2018 at 21:08
        
        > So unless you need to run ARMv8 code what’s the point?
        
        That I’ve yet not seen any such ‘performance / cost’ comparison of different AWS platforms. The Phoronix stuff at least is fundamentally broken (especially the ‘performance / cost’ calculations that generate pure nonsense numbers since using single-threaded performance scores on multi-core instances without taking exactly this into account. The other flaw is missing accuracy when calculating costs due to working with only two decimal places)
        
        BTW: 7-zip seems not to be optimized in any special way. At least this is one of the reasons why I use it as representation of ‘server workloads’ in general.
    2. blu says:
      
      November 29, 2018 at 17:26
      
      > Hand optimising stuff is impractical for most people.
      
      Most people don’t have a use case for running anything in the cloud, and yet here we are discussing clouds. And I surely have time ‘to mess around optimising the whole stack’ (hint: the whole stack does not need optimisation — bottlenecks do) — that’s what I’m being paid for to do at my job. Likewise with the Cloudflare usecase which got publicity recently — they surely had time to optimise their stack, and oh, miracle! — they had material benefits to reap.
      
      > If you have time to mess around optimising your whole stack like that you might as well fire a dev and hire a sysadmin to deploy your stuff on dedicated servers that are about a fifth of the price of AWS.
      
      You’re missing a set of scenarios. Let me give you an example: we cannot just ‘buy dedicated servers’ — that’s not up to us but to our clients, and our clients have use cases that scale from several workstations to massive farms, based on individual project. So they may need several thousand vCPUs on a given project, and then ten times as many on the next project — are you suggesting they should maintain the max dedicated server base, just in case? This before we make the basic observation that if a stack provider has optimised the stack for more than x86, that gives users the opportunity to optimise their dedicated servers expenditures, if they found viable workloads.
      
      Reply
      1. dgp says:
        
        November 29, 2018 at 17:49
        
        >they surely had time to optimise their stack, and oh, miracle!
        
        A big industry player that could save millions of dollars from saving fractions of a cent has the time to do it. Amazing. Who would have thunk it. It’s almost like you couldn’t come up with an example for an average AWS user so had to come up with something that isn’t even in the same ball park.
      2. blu says:
        
        November 29, 2018 at 18:46
        
        @dgp
        
        If you’re not going to bother reading past the first paragraph then I’ll not bother responding to you. As per usual, enjoy your day.
      3. dgp says:
        
        November 29, 2018 at 19:00
        
        I read your whole post. You keep bringing up stuff that makes no sense here; This is being pushed as being cost effective i.e. not for the people that need hundreds of cores and have a whole team of ARMv8 architecture experts on $300K or more a year. Those people don’t care about what cloudflare are doing for ARMv8. Those people use cloudfare etc because they don’t want to need to care.
        
        You do a lot of hand-wringing to explain why ARM is so great like it’s a personal friend you need to protect and at the same time a bunch of whining about X86 like it slept with your wife or something.
        It’s weird.
        
        This is a purely technical argument: Amazon says it’s up to 40-something-% cheaper but the pricing says no, the current performance says no. Unless you show something running for 40% less on one of these instances then everything else is moot and you need to concede the only reason for wanting one of these ARM instances would be a weird hate boner for X86.
  3. willy says:
    
    November 29, 2018 at 16:35
    
    > Which cloud customer would be that stupid to move workloads that do not scale whatsoever into the cloud? What’s the purpose of pulling in such single-threaded numbers into ‘Performance / Cost’ comparisons for multi-core systems at all?
    
    While I would generally agree, the problem with clouds is that the guy who decides on the migration decides to do it because he heard they’re using PHP or whatever that scales well, but that there are a number of satellite components around that do not scale this way at all that he hasn’t heard of but that are mandatory. Then it’s up to the people dealing with the day-to-day stuff to figure the bottlenecks and address them. So it definitely is important to know how it works for single-threaded workloads because low performance can create a bottleneck for a theorically scalable multi-threaded one, by saturating the non-scalable common component first.
    
    Reply
    1. tkaiser says:
      
      November 29, 2018 at 16:53
      
      > it definitely is important to know how it works for single-threaded workloads
      
      But neither is PHPBench telling you about overall performance for single-threaded workloads nor should such a single-threaded ‘score’ be used for ‘performance / cost’ ratios when judging about multi-core instances. It’s simply the wrong tool used in the wrong mode(s).
      
      This kitchen-sink benchmarking the Phoronix style is terribly misleading and there’s no alternative other than testing your own workload after switching on your own brain to generate insights and not just funny numbers and colored graphs.
      
      Reply
2. theguyuk says:
  
  November 29, 2018 at 16:28
  
  But does it not also depend which bucket ( container ) and instance is chosen. There are three types, each with different aims? When use Amazon web services.
  
  Reply
Lalith says:

November 29, 2018 at 15:52

The term “up to” is always misleading ….. as in “up to 65% of women saw an improvement with L’Oreal Paris Age Perfect”

Reply

Thanks for coming back with this piece. It’s a shame that right now it seems these instances don’t offer more bang for your buck but I wonder how much of that can be changed with compiler optimisations. I work mainly in Golang these days which works on ARMv8 but if this blog post from cloudflare (https://blog.cloudflare.com/arm-takes-wing/ ) is still accurate, doesn’t work well due to missing optimised assembly that doesn’t take much effort to add. It’d be interesting to spend a weekend hacking on it and see how far performance can be pushed.

> It’s a shame that right now it seems these instances don’t offer more bang for your buck

On which numbers is this based?

>On which numbers is this based?

Cheapest ARM instance:

a1.medium 1 NA 2 GiB EBS Only $0.0255 per Hour

Comparable X86 instances:

t3.small 2 Variable 2 GiB EBS Only $0.0208 per Hour
t2.small 1 Variable 2 GiB EBS Only $0.023 per Hour

a1.medium1 has to perform better than t2.small to be any more cost effective.
a1.medium1 has to perform twice as well as t3.small on a theoretical load that scales perfectly to multiple cores.

The *bad* phoronix benchmarks say that’s only the case for 7zip.
Some better benchmarks like those here: https://www.servethehome.com/putting-aws-graviton-its-arm-cpu-performance-in-context/ make it look less hopeless but not much better.

> a1.medium … t3.small … t2.small

Hmm… I’ve not seen any benchmarks for a1.medium so far so no idea how to compare exactly.

> The *bad* phoronix benchmarks say that’s only the case for 7zip.

And for ‘Rust Mandelbrot’ (or more generalized: when multi-threaded results have been used) . But the whole ‘performance / cost’ methodology Phoronix uses is so weird that I really don’t understand why it’s mentioned anywhere else.

You can’t use single-threaded scores for server instances with multiple vCPUs without multiplying the count of cores somehow in. ‘Forgetting’ this renders the whole ‘performance / cost’ ratio 100% useless. The other funny thing Phoronix is doing is using one time the benchmark number a specific test generates and then the whole execution time for the benchmark (including stuff like ‘pre-heating’ or initialization that should not count to benchmark execution). The latter whole execution time is then used to calculate the costs per time afterwards. And then due to missing accuracy or decimal points the already wrong calculations get way off again.

Seriously: why did we learn to pay attention to these Phoronix graphs? The methodology to generate them is so flawed that I can’t believe people don’t spot it when carefully looking at this stuff.

Let’s take the 7-zip numbers again and look at the hourly price of the ARMv8 and Intel instances:

                    MIPS   $ (per hour)
ARMv8 a1.xlarge:    6866       0.102
ARMv8 a1.2xlarge:  13589       0.204
ARMv8 a1.4xlarge:  25230       0.408
Intel m4.xlarge:   10769       0.192
Intel m5.2xlarge:  20803       0.384
Intel m5.4xlarge:  39564       0.768

MIPS $ (per hour)

ARMv8 a1.xlarge: 6866 0.102

ARMv8 a1.2xlarge: 13589 0.204

ARMv8 a1.4xlarge: 25230 0.408

Intel m4.xlarge: 10769 0.192

Intel m5.2xlarge: 20803 0.384

Intel m5.4xlarge: 39564 0.768

Per platform the formula is simple: twice the vCPUs –> twice the price. It’s rather easy to spot that as usual this benchmark does not scale linearly with count of added CPU cores (thereby reflecting reality) and if the price difference is taken into account ARM here has a slight ‘performance / cost’ benefit.

Since if we want to calculate ‘MIPS Per Dollar’ we simply need to divide the MIPS through the $ price and get what’s written below in the right column: ‘MIPS Per Dollar’:

                   Phoronix     Real
ARMv8 a1.xlarge:      -        67314
ARMv8 a1.2xlarge:  1358900     66613
ARMv8 a1.4xlarge:  2523000     61838
Intel m4.xlarge:   1076900     56088
Intel m5.2xlarge:  2080300     54174
Intel m5.4xlarge:  1318800     51515

Phoronix Real

ARMv8 a1.xlarge: - 67314

ARMv8 a1.2xlarge: 1358900 66613

ARMv8 a1.4xlarge: 2523000 61838

Intel m4.xlarge: 1076900 56088

Intel m5.2xlarge: 2080300 54174

Intel m5.4xlarge: 1318800 51515

The middle column is what Phoronix wants us to believe are ‘MIPS Per Dollar’. Michael Larabel now takes the benchmark execution time which is wrong since the performance score of such a 7-zip benchmark run does not correlate with benchmark execution time whatsoever. This benchmark is designed to take around 2 minutes on every system. He starts now to use what Amazon charges you hourly and divide by execution time: ‘ARMv8 a1.4xlarge: $0.4080 reported cost per hour, test consumed 2 Minutes, 10 Seconds: cost approximately 0.01 dollar.’

The result is not related to any performance number at all but simply how much ~2 minutes cost when using a specific Amazon instance. This number is 100% meaningless with this specific benchmark. But of course it’s used since we’re at Phoronix.

Seriously: everybody equipped with a brain looking at the above Phoronix numbers (Intel or ARM individually) immediately spots the mistake with these numbers: they’re close to random since why/how should the m5.2xlarge Intel instance be almost twice that cost efficient compared to the smaller m4.xlarge instance (using half as much CPU cores for half the price).

dgp says:

November 29, 2018 at 21:06

Yes, the phoronix stuff is garbage. I get that. I intentionally linked a whole new set of benchmarks to get you off of the phoronix stuff.

Take a look at the pricing: https://aws.amazon.com/ec2/pricing/on-demand/

These can only be cost effective if you need one of the core count/ram amounts that they don’t have an X86 configuration of. So going full cycle again: Any cost savings you have will not be because of magic ARM powers but because you can move to an instance that fits your work load better (number of core/ram) and/or you had too many under used instances in the fast place.
tkaiser says:

November 29, 2018 at 21:35

> I intentionally linked a whole new set of benchmarks to get you off of the phoronix stuff

Yeah, but how should the servethehome.com link help with getting a comparison with cheap Amazon Intel instances? I’ve no idea how those behave wrt performance and how much this might change with further Meltdown/Spectre mitigations.

Anyway: let’s stop here. My whole point here was the shocking experience that people for whatever reasons believe into the Phoronix garbage that is fundamentally broken in so many ways… Time will tell how performance/cost ratio with these Amazon A1 thingies will look like.

And while I personally won’t use them I still hope people get attracted to the platform to start with necessary optimizations on ARM so that we all can benefit from this stuff on embedded systems, SBC and ARM servers in general.
dgp says:

November 30, 2018 at 12:07

>Yeah, but how should the servethehome.com link help
>with getting a comparison with cheap Amazon Intel instances?

The benchmarks show these ARM cores are not as fast per core (across the board on synthetic benchmarks) as the Xeon etc cores used for the X86 instances. They aren’t as fast as other ARMv8 cores either. I suspect these are considerably cheaper for Amazon to run (maybe not for a few years until the R&D spend cancelled out) and that’s great for them but their customers are paying in units of time and not work performed so the more work done in X amount of time is the only thing they care about.

So.. again.. even if the price of like-for-like (num cores, amount of memory, storage) instances was cheaper for the ARM instances, which it isn’t, the predicted amount of work you would be getting from the ARM instance would be less unless it’s one of the very specific places you might be able to find where these cores can out pace a Xeon.

So the only reasons to use these things would be: Your app runs in one of the configurations there isn’t an x86 version of and that reduces your under-utilisation, you have something that has to run on ARM, or you’re acting like a child in an Xbox vs Playstation school yard fight and you think ARM is somehow a nice company compared to Intel or that instruction sets matter in this day and age.

>how much this might change with further Meltdown/Spectre mitigations.

ARM will have these exact same issues. Look at all of the errata fixes in the kernel config for ARM.

> I still hope people get attracted to the platform to start with necessary
>optimizations on ARM so that we all can benefit from this stuff on
>embedded systems, SBC and ARM servers in general.

Love it or hate it the Raspberry Pi would have probably been the best trigger for some mass outpouring of optimisations.
willy says:

November 30, 2018 at 00:14

> These can only be cost effective if you need one of the core count/ram amounts that they don’t have an X86 configuration of

This is a pretty valid point. Their pricing *seems* to be done to protect their x86 investment, since an equivalent-class (but slightly faster) ARM systematically is slightly more expensive than the x86. Thus unless you’re certain to use it at 100%, you can cut costs by using the cheaper x86 at a higher %CPU. In short, CPU-wise, their offering is mostly interesting for batched processing where you’re certain to use your ARM at full load. Maybe they’re doing this for now because they don’t have enough of them and don’t want all their x86 users to migrate too fast.
willy says:

November 30, 2018 at 00:10

> Since if we want to calculate ‘MIPS Per Dollar’
Please note that one important metric often is “peak MIPS for a given dollar” : the smallest amount of time it can take to complete a task. That’s where x86 usually are interesting because they offer higher peak MIPS than anything else, at 1.5-3 times the price of the slightly smaller option. But some people are willing to pay the price to get this and it’s fine.

dgp says:

November 29, 2018 at 18:03

>these instances don’t offer more bang for your buck

At least there is one apparently sane person here.

>doesn’t work well due to missing optimised assembly that doesn’t take much effort to add.

I wouldn’t underestimate it. The fact that that the optimisation is missing points to it being less than trivial to implement. If Amazon are really serious about this stuff they should already be working making the common stuff like Python, PHP, Java, Go etc work as well as possible on their platform.

Maybe in a year or so they’ll fix the pricing or considerably improve the performance or maybe in two years they’ll mark all of the a instance types as deprecated. Interesting times either way.

Reply

theguyuk says:

November 30, 2018 at 01:12

Quote from Phoronix.com
” At this stage, the Amazon EC2 ARM instances don’t make a lot of sense… Well, barely any sense unless you want scalable, on-demand access to ARMv8 computing resources for a build farm, ARM software debugging/testing, and related purposes. The performance of the Graviton processors powering the A1 instances came up well short of the comparable M5 general instance types with either AMD EPYC or Intel Xeon processors. Even with the cheaper pricing, the performance-per-dollar was still generally just on-par with the equivalent or slightly better than the Intel/AMD offerings.

Only in the few threaded workloads where Graviton was performing well did it offer potentially compelling cost savings compared to the other tested instances. Part of the advantage of ARM processors is also better power efficiency, but well, that doesn’t really translate into much direct value for cloud customers. A1 makes sense for the niche though of developers wanting more easy to higher core count 64-bit ARM for software testing or build farms as these instances would offer much better capacity than Raspberry Pi type build farms. But for other cloud customers this is certainly a case of first needing to see how well your particular workloads will perform with Graviton for making a proper decision, but most users will likely be best off with the existing Intel Xeon and AMD EPYC instance types.

I’m still running more tests of A1/Graviton over the coming days to get a better idea for its performance, but at this point at the end of 2018 there isn’t much to get excited about for these Graviton processors on EC2 as we approach the end of 2018.

” end quote

Reply
pmos69 says:

November 30, 2018 at 07:25

A more simplistic benchmark approach:
https://www.youtube.com/watch?v=KLz8gC235i8

Reply
1. dgp says:
  
  November 30, 2018 at 11:06
  
  His benchmark is crude and again he’s saying “cheaper cheaper cheaper” when he’s comparing an X86 instance with 32GB of ram to one with 16GB of ram.
  
  Reply
2. blu says:
  
  November 30, 2018 at 13:20
  
  That’s a perfectly normal result for properly optimised (integer) code not giving precedence to one arch or another. Keep in mind small code is easier brought to that state, compared to arbitrary collections of 3rd party libraries and frameworks that usually constitute software packages. But this is where library (and full stack) providers come in.
  
  Reply
  1. dgp says:
    
    November 30, 2018 at 13:56
    
    >compared to arbitrary collections of 3rd party libraries and frameworks
    
    You see to have poor reading comprehension. This is exactly why I said it’s crude. But hey, it’s a single benchmark that shows that these things aren’t total lemons. That’s a massive win for you. You can sleep well tonight.
    Shame the pricing is still garbage and makes no sense.
    
    Reply
    1. dgp says:
      
      November 30, 2018 at 14:24
      
      And FYI he’s benchmarking 8 threads on the X86 vs 8 real arm cores and the ARM just barely wins.
      So on raw performance these things are lemons.
      
      Reply
    2. blu says:
      
      November 30, 2018 at 15:41
      
      > You seem to have a reading comprehension.
      
      ..Says the guy who couldn’t figure out I was not replying to him.
      
      Reply
      1. dgp says:
        
        November 30, 2018 at 15:55
        
        It looked nested to me so I apologise. Other than that what I wrote still stands and no one has come up with any numbers to prove these are more cost efficient.
      2. tkaiser says:
        
        November 30, 2018 at 16:41
        
        > no one has come up with any numbers to prove these are more cost efficient
        
        Maybe no one here works for Amazon’s marketing department (only they are interested in proving exactly this). As already said: time will tell how this will look like in the wild.
        
        I think we’re talking here in one comment thread about a bunch of totally different things at the same time.
        
        * benchmarking gone wrong (especially some totally flawed methodologies –> Phoronix)
        * platform/code optimizations (blu)
        * performance/cost ratio (you)
        
        At least I’m only interested in these ARM AWS instances since I still have some hope that with this marketing hype around ARM as server platform some urgently needed optimizations will arrive (sooner) in kernel and lib code so we can all benefit from with our embedded ARM devices and physical servers.
        
        One IMO obvious example: when you’re running low on RAM but have plenty of CPU power left you might want to switch on zram and run with some nice memory overcommitment. On x86 the compression algos are SIMD accelerated but not (yet) on ARM.
      3. dgp says:
        
        November 30, 2018 at 17:10
        
        >* benchmarking gone wrong (especially some totally flawed methodologies –> Phoronix)
        * platform/code optimizations (blu)
        * performance/cost ratio (you)
        
        These are all related. You need benchmarking to see if this worth bothering with. Optimisations might help close the gap if it’s close and the actual cost is dependent on both of those points.
        
        >I still have some hope that with this marketing hype around ARM as server platform
        
        Why does it have to be ARM though? I like ARM stuff but not because it’s ARM but because you can buy chips that have good enough performance for what most jobs and they are dirt cheap.
        Even if something turns up that’s clock for clock better than an i9 and you can get it to run at frequencies/core configurations that mean it’ll actually out perform the i9 but it costs 10-30% more would you pay that premium just because it’s ARM? That seems bonkers to me. Some sort of ISA tribalism.
      4. tkaiser says:
        
        November 30, 2018 at 17:27
        
        > Why does it have to be ARM though?
        
        Since there are no competitive SBC and embedded devices with other architectures around. I said it already multiple times: I’m not interested in running anything on these ARM or other AWS instances (quite the opposite). My only interest in these server instances and the fresh hype around is the potential for software optimizations that might now happen (sooner) so I can benefit from on all my other ARM devices.
        
        I don’t give a sh*t about the performance / cost ratio of any of these AWS offers since I do not want to run anything there or in some other ‘cloud’. When participating in these discussions here about the issue it was more of a general interest in benchmarking to probably get new insights or drop own methodologies. Didn’t happen though…
      5. dgp says:
        
        November 30, 2018 at 17:37
        
        >potential for software optimizations that might now happen (sooner)
        >so I can benefit from on all my other ARM devices.
        
        >I don’t give a sh*t about the performance / cost ratio of any of these AWS offers
        >since I do not want to run anything there or in some other ‘cloud’.
        
        These two statements are contradictory. You might not want AWS instances personally but if they are not cost effective no one will use them, they’ll die on the vine like all of the other efforts to get ARM on the server to be anything but very niche custom hardware. That’s why I keep saying this.
        Amazon says these are dramatically cheaper. If they really then you will have people leaping on it and you’ll get your optimisations. If they aren’t and the 40% number is bs then you’ll have a few months of blog posts about how amazing this is because “it’s ARM and ARM is RISC, not smelly old CISC” and that’ll be it.
        
        For you to get the downstream benefit of this it needs to be a success and that depends entirely on these things actually offering better value than the status-quo.
3. tkaiser says:
  
  November 30, 2018 at 16:17
  
  This is exactly the same BS like using sysbench --test=cpu to test for CPU performance. And it gets totally weird when bringing in the RPi at https://youtu.be/KLz8gC235i8?t=378 — the reason why the RPi can’t compete here is since the Cortex A53 on the RPi are brought up in 32-bit state and this prime number calculation stuff then uses other instructions. Use 64-bit kernel/userland on the RPi (using pi64 https://github.com/Crazyhead90/pi64/releases ) and it will run 15 times faster with this terribly useless benchmark. But that’s not due to architectural differences between the ARM cores on the RPi and those on these Alpine SoCs used by Amazon. It’s once again a compiler benchmark and ‘benchmarking gone wrong’ as it almost always happens.
  
  At least the comparison between the two Amazon instances is not totally flawed since using same compiler versions and 64-bit settings. But this prime number stuff is useless if it’s about getting an idea how a ‘system as a whole’ performs since the whole calculation happens inside CPU caches so DRAM access (memory bandwidth and latency) is not part of the picture.
  
  Reply
  1. blu says:
    
    November 30, 2018 at 16:32
    
    The guy is trying to show A1 to the uninitiated (youtube) masses — I think he’s far from any claims of usefulness or cloud usage practicality, which is further backed by his seemingly random inclusion of the RPi, as you note.
    
    ps: it’s Annapurna ; )
    
    Reply
    1. tkaiser says:
      
      November 30, 2018 at 16:53
      
      > The guy is trying to show A1 to the uninitiated (youtube) masses
      
      But his conclusions are totally wrong and simply demonstrate the mess. People not understanding that benchmarks always test software. The primitive benchmark he chose does not perform better due to being run on a ‘proper server chip running in a huge configuration with lots of memory and lots of caches and all the stuff they need’. In fact his flawed benchmark does not depend on ‘lots of memory and lots of caches and all the stuff they need’ at all but runs entirely inside the CPU and is only a great example for a compiler benchmark or ‘how software and settings matter’. Just switch from Raspbian to pi64 and this single benchmark will execute 15 times faster.
      
      OTOH I do understand that such comparisons are necessary since this is also one of the few reasons I almost hate the Raspberry Pi. All tries to implement energy efficient computing in the past mentioning ARM were useless once the customer already had access to a Raspberry Pi. This lousy design unfortunately is synonym for ‘ARM device’ and people won’t believe that ARM thingies exist that do not suck totally.
      
      Wrt Annapurna Labs: all their SoCs so far are called Alpine so I assumed this hasn’t changed.
      
      Reply
      1. dgp says:
        
        November 30, 2018 at 17:22
        
        >But his conclusions are totally wrong and simply demonstrate the mess.
        
        This and the way he has explained it seems to be him saying that these cores are finally the nail in the coffin for Intel when they aren’t even as good as other ARMv8s.
        
        >This lousy design unfortunately is synonym for ‘ARM device’
        >and people won’t believe that ARM thingies exist that do not suck totally.
        
        I don’t think that’s all the RPi’s fault. ARM SoCs have up until recently have mainly been an ARM core with an assortment of junk IP blocks attached especially in the tens of dollars bracket. ARM’s own GPUs don’t have *proper* drivers. Even if the main processors are fast the surrounding hardware and software for it is mostly a mess and only works for baked platforms like Android.
      2. blu says:
        
        November 30, 2018 at 17:49
        
        @tkaiser,
        
        I agree that what the guy demonstrates has much more to do with the quality of compilers, more specifically codegen, and much less to do with A1 as a viable cloud service. But at least he gives a reference point*, in contrast to what phoronix did with many tests in their ‘proper’ review.
        
        BTW, not having seen his test code, but guessing off the top of my hat what it might be doing, a (multi-threaded or otherwise) Sieve of Eratosthenes uses 1 bit per number in the sought range of numbers, so to get 12.5M primes, he’d need to allocate a bitmap of 28MB (12.5M / 8 = 1,562,500 bytes, factored by 18 to account for the approx. average density of primes in that range, which is ~1:18). So caches should and do affect such benchmarks, and the 2MB of L2 in the server cores would help here, vis-a-vis some 1/4 or 1/2 MB of L2 in a phone part.
        
        And thanks for the Annapurna heads-up — I had forgotten about their lineup branding.
        
        * point being 8x CA72 @ 2.3GHz perform equally to 4x BWL @ 3.3GHz (4 cores turbo) w/ SMT at this particular workload.
      3. dgp says:
        
        November 30, 2018 at 17:58
        
        >BTW, not having seen his test code,
        https://github.com/garyexplains/examples.git
        
        >* point being 8x CA72 @ 2.3GHz perform equally to 4x BWL @ 3.3GHz
        >(4 cores turbo) w/ SMT at this particular workload.
        
        That might depend on the time of day, the direction of the wind etc as you have other instances running vCores(threads) on those physical cores. You are also limited on how much you can schedule on the core and have a limited number of “credits” that allow you to burst.
      4. blu says:
        
        November 30, 2018 at 21:23
        
        Thanks for the test url. Based on that:
        
        1. Test from the repo is not using the Sieve of Eratosthenes — it’s using a naive test for primes, similarly to what I did recently in the prime factorizer article. So my bitmap usage hypothesis above is void, and L2 caches don’t affect this test one bit.
        2. Test from the repo computes the primes *among the first 12.5M natural numbers*, not first 12.5M. primes per se! Test is replicated across all cores — it’s more of a multi-instanced workload, rather than a distributed multi-threaded workload.
        3. gcc as old as 4.8 produces perfectly good code for this test on x86 — i.e. there are no codegen impediments as in my article.
        
        Re the indicativness of the test running on the AWS xeons in the video:
        
        On my xeon here test took 10.661s for the same core/threads ratio as in the AWS run (i.e. 1:2), for the same amount of work per thread. Given my xeon runs at 3.1GHz
        
        $ echo “scale=4; 10.661 * 3.1 / 3.4” | bc
        9.7203
        
        According to that time, the AWS xeons from the video doing the test for 9.845s must have run at 3.3-3.4GHz. Let’s assume for the sake of simplicity they did ~3.3GHz. Original estimate was right.
      5. blu says:
        
        November 30, 2018 at 23:03
        
        > Re the indicativness of the test running on the AWS xeons in the video
        
        Update:
        
        I just realized the AWS xeons are BWL, but my workstation is SNB, which would not pose much of a difference in other scenarios, but in this particular test DIV throughput is essential for the multi-threaded times, and DIV throughput on BWL is 1.2x-2x the DIV throughout on SNB. So it’s possible the clock of the AWS xeons might have been as low as ~2.3GHz.
      6. blu says:
        
        December 1, 2018 at 04:30
        
        Ok, I had to get to the bottom of this Gary Explains video.
        
        Here is the exact same test on my CA72 @ 2.1GHz chromebook — remember, the number of threads barely matters for the duration of this test on non-SMT machines, the two threads specified below are just because.
        
        $ gcc-7.3 -Ofast -fstrict-aliasing threadtesttool.c -mcpu=cortex-a57 -mtune=cortex-a57 -lpthread
        $ time taskset 0xc ./a.out 2 12500000
        Threading test tool V1.0. (C) Gary Sims 2018
        Threads: 2. Primes to find: 12500000
        
        real 0m5.292s
        user 0m10.336s
        sys 0m0.000s
        
        Vs
        
        real 0m8.993s in the video. On a 2.3GHz CA72..
        
        But what might be wrong on that AWS machine?
        
        (I quote from the 3bd7645afe commit message):
        
        ‘Comile with:
        gcc -lpthread -o threadtesttool threadtesttool.c’
        
        If that’s how the test was compiled in the video, I have just one thing to say: Why, Gary, why?
        
        So this benchmark might be a case of ‘platform-impartial’ zero-compiler-optimisation code /sigh
      7. blu says:
        
        December 1, 2018 at 07:24
        
        Ok, enough tomfoolery.
        
        a1.medium:
        
        $ uname -a
        Linux ip-172-31-2-119 4.15.0-1028-aws #29+nutmeg8-Ubuntu SMP Tue Nov 20 02:59:41 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
        $ cat /proc/cpuinfo
        processor : 0
        BogoMIPS : 166.66
        Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
        CPU implementer : 0x41
        CPU architecture: 8
        CPU variant : 0x0
        CPU part : 0xd08
        CPU revision : 3
        $ gcc-7.3 -Ofast -fstrict-aliasing threadtesttool.c -mcpu=cortex-a57 -mtune=cortex-a57 -lpthread
        $ time ./a.out 1 12500000
        Threading test tool V1.0. (C) Gary Sims 2018
        Threads: 1. Primes to find: 12500000
        
        real 0m4.766s
        user 0m4.765s
        sys 0m0.000s
        
        $ echo "scale=4; 4.766 * 2.3 / 2.1" | bc
        5.2199
        
        # hey look, this vm has the proper per-clock performance of an actual CA72!
        # times would be identical-within-a-margin for 2 threads on an a1.large
        
        m5.large:
        
        $ uname -a
        Linux ip-172-31-41-35 4.15.0-1021-aws #21-Ubuntu SMP Tue Aug 28 10:23:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
        $ cat /proc/cpuinfo
        processor : 0
        vendor_id : GenuineIntel
        cpu family : 6
        model : 79
        model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
        stepping : 1
        microcode : 0xb00002a
        cpu MHz : 2300.162
        cache size : 46080 KB
        physical id : 0
        siblings : 2
        core id : 0
        cpu cores : 1
        apicid : 0
        initial apicid : 0
        fpu : yes
        fpu_exception : yes
        cpuid level : 13
        wp : yes
        flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
        bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
        bogomips : 4600.17
        clflush size : 64
        cache_alignment : 64
        address sizes : 46 bits physical, 48 bits virtual
        power management:
        
        processor : 1
        vendor_id : GenuineIntel
        cpu family : 6
        model : 79
        model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
        stepping : 1
        microcode : 0xb00002a
        cpu MHz : 2300.162
        cache size : 46080 KB
        physical id : 0
        siblings : 2
        core id : 0
        cpu cores : 1
        apicid : 1
        initial apicid : 1
        fpu : yes
        fpu_exception : yes
        cpuid level : 13
        wp : yes
        flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
        bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
        bogomips : 4600.17
        clflush size : 64
        cache_alignment : 64
        address sizes : 46 bits physical, 48 bits virtual
        power management:
        
        $ gcc-7.3 -Ofast -fstrict-aliasing threadtesttool.c -march=broadwell -mtune=broadwell -lpthread
        $ time ./a.out 2 12500000
        Threading test tool V1.0. (C) Gary Sims 2018
        Threads: 2. Primes to find: 12500000
        
        real 0m7.358s
        user 0m14.715s
        sys 0m0.000s
        
        # not a particularly high-clock this one
        
        a1.large 2 NA 4 GiB EBS Only $0.051 per Hour
        m5.large 2 8 8 GiB EBS Only $0.096 per Hour
        
        # You're welcome, Gary.
      8. willy says:
        
        December 1, 2018 at 14:14
        
        By the way, gcc-7.3 supports -mtune=cortex-a72 in case you’re interested in testing for any difference.
      9. blu says:
        
        December 1, 2018 at 14:36
        
        I know, but the uarch scheduling differences between CA57 and CA72 are close to nil for scalar integer code (for comparisons, they are smaller than those between SNB and IVB) and, for some arcane compiler reasons, I’ve seen native CA72 scheduling produce worse times on some tests, so my modus operandi is to go with CA57.
        
        Thanks for bringing this up for the general reader, though — I usually skip these details.
  2. dgp says:
    
    November 30, 2018 at 16:35
    
    >And it gets totally weird when bringing in the RPi
    
    I found that strange too. If you’re going to do the comparison between an AWS ARM and a SBC why not use any SBC that isn’t totally crippled like the RPi.
    
    >two Amazon instances is not totally flawed since using same compiler versions and 64-bit settings.
    
    The two instances aren’t like for like. The have the same “core” count but a core in the X86 instances is a thread on a core. The results of his benchmark aren’t terribly different so the only real win for the ARM instance was being cheaper because it only had half the amount of memory.
    
    Reply
theguyuk says:

November 30, 2018 at 22:11

The real important bit here is context and having read numerous articles on these aws, the savings are TCO compute for Amazon.
Amazon are just using other peoples work loads for real life testing, proof of concept. The savings are power bill and hardware costs, for Amazon.

Reply
nobitakun says:

December 1, 2018 at 16:43

Really? people does not notice that android emulation will be WAY FASTER than in a x86 cpu? for the company I work to those are great news, since we virtualize everything through remote servers, but workers need android devices in their daily tasks. If Amazon is starting to offer this means VMWare is soon releasing tge ESXi for ARM and there we go emulating hundreds of devices instead of having those devices physically or shitty emulators slowing down everything.

Reply
cnxsoft says:

December 3, 2018 at 09:39

EC2 A1 instances presentation slides for re:invent 2018
https://www.slideshare.net/AmazonWebServices/new-launch-introducing-amazon-ec2-a1-instances-based-on-the-arm-architecture-cmp391-aws-reinvent-2018

Reply
1. blu says:
  
  December 3, 2018 at 13:12
  
  BTW, I can attest to the bare-metal-like performance mentioned on slide 11 — I was really surprised to see vCPUs doing so close to metal on tasks with some io (for reference — better than the VM in chromeos)
  
  Reply
matt says:

December 9, 2018 at 06:15

Looking at the AWS instance pricing I am not seeing a 45% cost savings on the a1 instances. In fact it looks like in many cases you can get more performance at a lower cost using the t3 instances.

Reply
Bala says:

January 21, 2019 at 17:00

Hi, Amazon EC2 A1 would support nginx or apache2 ? i mean can we install nginx or apache2

Reply
babzoo says:

January 21, 2019 at 21:59

https://www.scaleway.com/pricing/
but
https://www.ctrl.blog/entry/scaleway-nothanks
https://hostadvice.com/hosting-company/scaleway-reviews/
https://www.trustpilot.com/review/scaleway.com

currently (JAN 2019)
https://www.packet.com/
have a survey, please filled.

Intel getting some competition is good.

If you know different provider than amazon and scaleway
I’ll love to hear.

Reply

Boardcon LGA3576 Rockchip RK3576 System-on-Module designed for AI and IoT applications

74 Replies to “Amazon EC2 A1 Arm Instances Deliver up to 45% Cost Savings over x86 Instances”

Leave a Reply Cancel reply

Leave a Reply