Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.
posted by martyb on Monday January 08 2018, @05:07PM   Printer-friendly
from the YAA!-Yet-Another-Algorithm! dept.

From Quanta Magazine:

To efficiently analyze a firehose of data, scientists first have to break big numbers into bits.

Computer programs that perform these kinds of on-the-go calculations are called streaming algorithms. Because data comes at them continuously, and in such volume, they try to record the essence of what they've seen while strategically forgetting the rest. For more than 30 years computer scientists have worked to build a better streaming algorithm. Last fall a team of researchers invented one that is just about perfect. "We developed a new algorithm that is simultaneously the best" on every performance dimension, said Jelani Nelson, a computer scientist at Harvard University and a co-author of the work with Kasper Green Larsen of Aarhus University in Denmark, Huy Nguyen of Northeastern University and Mikkel Thorup of the University of Copenhagen. This best-in-class streaming algorithm works by remembering just enough of what it's seen to tell you what it's seen most frequently. It suggests that compromises that seemed intrinsic to the analysis of streaming data are not actually necessary. It also points the way forward to a new era of strategic forgetting.

Small numbers are easier to keep track of than big numbers.

Imagine, for example, that you're monitoring a stream of numbers between zero and 50,000,000 (a task similar to logging internet users by their IP addresses). You could keep track of the numbers using a 50,000,000-term index, but it's hard to work with an index that size. A better way is to think of each eight-digit number as four two-digit numbers linked together. Say you see the number 12,345,678. One memory-efficient way to remember it is to break it into four two-digit blocks: 12, 34, 56, 78. Then you can send each block to a sub-algorithm that calculates item frequencies: 12 goes to copy one of the algorithm, 34 goes to copy two, 56 goes to copy three, and 78 goes to copy four. Each sub-algorithm maintains its own index of what it's seen, but since each version never sees anything bigger than a two-digit number, each index only runs from 0 to 99. An important feature of this splitting is that if the big number — 12,345,678 — appears frequently in your overall data stream, so will its two-digit components. When you ask each sub-algorithm to identify the numbers it has seen the most, copy one will spit out 12, copy two will spit out 34, and so on. You'll be able to find the most frequent members of a huge list just by looking for the frequent items in four much shorter lists.

I wonder if any Soylenters have heard of similar solutions.

Full Article
The paper at arxiv.org


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Informative) by Anonymous Coward on Monday January 08 2018, @05:10PM (2 children)

    by Anonymous Coward on Monday January 08 2018, @05:10PM (#619584)

    What's new here?

    Starting Score:    0  points
    Moderation   +1  
       Informative=1, Total=1
    Extra 'Informative' Modifier   0  

    Total Score:   1  
  • (Score: 2) by tibman on Monday January 08 2018, @09:48PM

    by tibman (134) Subscriber Badge on Monday January 08 2018, @09:48PM (#619732)

    Scientist discovers Computer Science?

    --
    SN won't survive on lurkers alone. Write comments.
  • (Score: 3, Informative) by FatPhil on Monday January 08 2018, @09:50PM

    by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Monday January 08 2018, @09:50PM (#619735) Homepage
    No, they've repurposed bloom filters.

    There's no tree here, at least from what's described in TFS. They've collapsed ("projected" to be more mathematical, but I guess you could also say "hashed", for some embarassingly trivial hash functions) a large index into 4 smaller flat indexes. No root, no internal nodes, no hierarchy, so no tree.

    Sounds shit, rather than "best ever" - if the two biggest hits are for 01234567 and 89ABCDEF equally, this algorithm will be unable to tell which two of 01234567, 012345EF, 0123CD67, 0123CDEF, ... 8 more ..., 89AB4567, 89AB45EF, 89ABCD67, and 89ABCDEF.

    One way of improving things is to have 7 indexes, and file 01234567 under respectively <01, 12, 23, 34, 45, 56, 67>, and 89ABCDEF under <89, 9A, AB, BC, CD, DE, EF>. By following the peaks, you can't "hop" from one number to the other, so that will identify just the two peaks, not any mix-and-match numbers.

    Which is rather similar to how google etc. do full-text-search, except they'll index trigrams or tetragrams. Documents containing "excr", "xcre", "cret", and "reta" almost certainly contain "excreta" (they do something more fancy, but requiring way more space, thus not particularly solving the "big" part of the "big data" problem.
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves