Stories
Slash Boxes
Comments

SoylentNews is people

Submission Preview

Link to Story

Can Apache Spark Process 100 Terabytes of Data in Interactive Mode?

Accepted submission by Phoenix666 at 2015-10-21 11:52:18
Code

Apache Spark innovates a lot of in the in-memory data processing area. With this framework, you are able to upload data to a cluster memory and work with this data extremely fast in the interactive mode (interactive mode is another important Spark feature btw…). One year back (10/10/2014) Databricks announced that Apache Spark was able to sort 100 terabytes of data in 23 minutes.

Here is an interesting question – what is the limit for the amount of data you can process interactively in a cluster [fullstackml.com]? What if you had 100 terabytes of memory in your cluster? Memory is so quick you would think! Intuition tells you can use this memory to interactively process 100 terabytes of input data or at least half of this size. However, as usual in a distributed systems world, our intuition is wrong!


Original Submission