AWS Spark matches Edison supercomputer for Big Data HPC

I've been blogging about the overlap between HPC and Big Data for a couple of years now. I found out today IDC has even given it an acronym: HPDA (High Performance Data Analytics). HPDA applications go beyond just the simulation tasks traditionally associated with HPC, and use the prodigious amounts of data often associated with business data (click streams, social media interactions, etc.) that are increasingly common in scientific domains (DNA sequencing, not throwing away sensor data, imaging, etc.).

Last week, UC Berkeley authors published a paper Scientific Computing Meets Big Data Technology: An Astronomy Use Case comparing Spark on AWS EC2 vs. a supercomputer on the task of astronomy image processing. Among other things, they found that a 64-node/512-core Spark cluster on EC2 performed equivalently to a 21-node/504-core subset of Edison, ranked the 34th fastest supercomputer on TOP500 as of June, 2015. Now it was a small subset of Edison, like 0.4%, as Edison has a total of 133,824 cores. But it shows that core-for-core, Spark on EC2 is as fast as a supercomputer at HPDA tasks.

What does this mean? It means that data locality with commodity hardware can compensate for not having the higher-performance storage and networking found in supercomputers.

It also means that military and universities may start looking to Spark instead of big HPC purchases. I personally know of one university already in the midst of such an evaluation regarding processing bioinformatics data. It may even mean that universities forgo purchases altogether and just rent AWS time. The biggest hurdle for that at this point may simply be political -- it seems to be politically easier for universities to approve a $20m purchase once every three years than to give each of its researchers a $2000 AWS credit.

IBM and other supercomputer manufacturers may not be Cray's biggest competitors now -- it may be Amazon.