Stories
Slash Boxes
Comments

SoylentNews is people

posted by cmn32480 on Friday October 23 2015, @09:53PM   Printer-friendly
from the that's-a-lot-of-stuff dept.

Apache Spark innovates a lot of[sic] in the in-memory data processing area. With this framework, you are able to upload data to a cluster memory and work with this data extremely fast in the interactive mode (interactive mode is another important Spark feature btw...). One year back (10/10/2014) Databricks announced that Apache Spark was able to sort 100 terabytes of data in 23 minutes.

Here is an interesting question – what is the limit for the amount of data you can process interactively in a cluster? What if you had 100 terabytes of memory in your cluster? Memory is so quick you would think! Intuition tells you [you] can use this memory to interactively process 100 terabytes of input data or at least half of this size. However, as usual in a distributed systems world, our intuition is wrong!

[Ed. addition.] The Databricks announcement has hard data on how they went about improving the Apache Spark code and the performance improvements they saw. That story summarizes its results as follows:

Databricks, with the help of the Spark community, has contributed many improvements to Apache Spark to improve its performance, stability, and scalability. This enabled Databricks to use Apache Spark to sort 100 TB of data on 206 machines in 23 minutes, which is 3X faster than the previous Hadoop 100TB result on 2100 machines. Similarly, Databricks sorted 1 PB of data on 190 machines in less than 4 hours, which is over 4X faster than the previous Hadoop 1PB result on 3800 machines.

Does anyone here work with data on this scale? What do you use for your database? Would you consider using Apache Spark?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Informative) by TK 420 on Saturday October 24 2015, @01:19AM

    by TK 420 (2525) on Saturday October 24 2015, @01:19AM (#253861)

    Yes, I work with data on this scale and we use Cassandra for the database with Spark. Started with Hadoop then moved to Spark. Used Hive then Shark for SQL back to Cassandra. Used Oozie for scheduling, but ripped that out and went custom.

    Can Spark process 100 Terabytes of data? Sure, why not? The tech has gotten to the point where if you have enough money and time you can figure it out and at least get close. It's just every project feels like a research project that may never come together and expectations are thrown off by the idea that all data is the same so if one project can do X in Y time then so should the next one.

    Starting Score:    1  point
    Moderation   +2  
       Insightful=1, Informative=1, Total=2
    Extra 'Informative' Modifier   0  

    Total Score:   3  
  • (Score: 3, Informative) by martyb on Saturday October 24 2015, @01:49AM

    by martyb (76) on Saturday October 24 2015, @01:49AM (#253867) Journal

    Thanks for that! Times sure have changed. About 20 years ago, I worked at a Major Retirement Fund company on a data warehouse application that was fed from all of their mutual fund databases and it contained a whopping 150 GB of data. It ran on a Teradata system which, IIRC, had 100 processors and a huge number of disk drives. Now, I could probably buy a workstation and load up enough RAM to do all that processing in memory. It would be much faster and at a fraction of the cost.

    The advance of technology's capabilities never ceases to amaze me.

    On the other hand, in light of the story we recently covered: Vanishing Haystacks: MIT's Data Science Machine & Deep Feature Synthesis [soylentnews.org], the ability of these systems to extract targeted information about people is getting to be scary. Just think of all of the data that has been captured and never really been put to use. I suspect that within a year or two, these data will be mined and there will be a new burst of spam, both paper and e-mail, trying to monetize that data.

    --
    Wit is intellect, dancing.
  • (Score: 2) by goodie on Saturday October 24 2015, @01:53AM

    by goodie (1877) on Saturday October 24 2015, @01:53AM (#253868) Journal

    Exactly... The headline reads like GPU benchmarking, data warehouse loads (MS used to claim that SSIS could like a huge amount of data but their tests involved a pre-processed binary file, no transformations, and a direct output...) and other cases where things can work that way. But with your data, needs, hardware etc. it's not that obvious.

    Now I wonder though what kind of companies use this much data. I can see insurance, finance, big techs (google etc.), but is the technology and the things you can do with this amount of data and processing power available to other, unexpected players? I'm really curious about this :)