Stories
Slash Boxes
Comments

SoylentNews is people

posted by cmn32480 on Friday October 23 2015, @09:53PM   Printer-friendly
from the that's-a-lot-of-stuff dept.

Apache Spark innovates a lot of[sic] in the in-memory data processing area. With this framework, you are able to upload data to a cluster memory and work with this data extremely fast in the interactive mode (interactive mode is another important Spark feature btw...). One year back (10/10/2014) Databricks announced that Apache Spark was able to sort 100 terabytes of data in 23 minutes.

Here is an interesting question – what is the limit for the amount of data you can process interactively in a cluster? What if you had 100 terabytes of memory in your cluster? Memory is so quick you would think! Intuition tells you [you] can use this memory to interactively process 100 terabytes of input data or at least half of this size. However, as usual in a distributed systems world, our intuition is wrong!

[Ed. addition.] The Databricks announcement has hard data on how they went about improving the Apache Spark code and the performance improvements they saw. That story summarizes its results as follows:

Databricks, with the help of the Spark community, has contributed many improvements to Apache Spark to improve its performance, stability, and scalability. This enabled Databricks to use Apache Spark to sort 100 TB of data on 206 machines in 23 minutes, which is 3X faster than the previous Hadoop 100TB result on 2100 machines. Similarly, Databricks sorted 1 PB of data on 190 machines in less than 4 hours, which is over 4X faster than the previous Hadoop 1PB result on 3800 machines.

Does anyone here work with data on this scale? What do you use for your database? Would you consider using Apache Spark?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by darkfeline on Saturday October 24 2015, @12:13AM

    by darkfeline (1030) on Saturday October 24 2015, @12:13AM (#253855) Homepage

    A lot of Big Data is limited by fundamental limits of physics. It is only possible to move around so much data, over so much distance, in such a time frame.

    For example, no matter how much memory you have, you're limited by the bandwidth between the memory and the processor (how much data can be moved in a given time frame) and latency (how much time it takes between a request for data and the data actually being received, a round-trip from the processor to memory, if you will).

    In Big Data, one bottleneck is simply reading in all of that data once. How long do you think it would take for your computer to cat a 1PB file?

    There's a lot of optimizations and clever engineering, but there are still limits that simply cannot be worked around.

    --
    Join the SDF Public Access UNIX System today!
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 1, Informative) by Anonymous Coward on Saturday October 24 2015, @02:19AM

    by Anonymous Coward on Saturday October 24 2015, @02:19AM (#253873)

    The idea is to shard the data onto thousands of rack-mounted servers (x86_64 running Linux, for now). Maybe more than 10,000 servers, in the case of Google or FB. As far as the JVM penalty which an AC mentioned, it doesn't matter so much when the bottlenecks are disk I/O and network bandwidth.

    But if it does matter, you can go with MapR (they rewrote HDFS in C++).