Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Monday January 08 2018, @05:07PM   Printer-friendly
from the YAA!-Yet-Another-Algorithm! dept.

From Quanta Magazine:

To efficiently analyze a firehose of data, scientists first have to break big numbers into bits.

Computer programs that perform these kinds of on-the-go calculations are called streaming algorithms. Because data comes at them continuously, and in such volume, they try to record the essence of what they've seen while strategically forgetting the rest. For more than 30 years computer scientists have worked to build a better streaming algorithm. Last fall a team of researchers invented one that is just about perfect. "We developed a new algorithm that is simultaneously the best" on every performance dimension, said Jelani Nelson, a computer scientist at Harvard University and a co-author of the work with Kasper Green Larsen of Aarhus University in Denmark, Huy Nguyen of Northeastern University and Mikkel Thorup of the University of Copenhagen. This best-in-class streaming algorithm works by remembering just enough of what it's seen to tell you what it's seen most frequently. It suggests that compromises that seemed intrinsic to the analysis of streaming data are not actually necessary. It also points the way forward to a new era of strategic forgetting.

Small numbers are easier to keep track of than big numbers.

Imagine, for example, that you're monitoring a stream of numbers between zero and 50,000,000 (a task similar to logging internet users by their IP addresses). You could keep track of the numbers using a 50,000,000-term index, but it's hard to work with an index that size. A better way is to think of each eight-digit number as four two-digit numbers linked together. Say you see the number 12,345,678. One memory-efficient way to remember it is to break it into four two-digit blocks: 12, 34, 56, 78. Then you can send each block to a sub-algorithm that calculates item frequencies: 12 goes to copy one of the algorithm, 34 goes to copy two, 56 goes to copy three, and 78 goes to copy four. Each sub-algorithm maintains its own index of what it's seen, but since each version never sees anything bigger than a two-digit number, each index only runs from 0 to 99. An important feature of this splitting is that if the big number — 12,345,678 — appears frequently in your overall data stream, so will its two-digit components. When you ask each sub-algorithm to identify the numbers it has seen the most, copy one will spit out 12, copy two will spit out 34, and so on. You'll be able to find the most frequent members of a huge list just by looking for the frequent items in four much shorter lists.

I wonder if any Soylenters have heard of similar solutions.

Full Article
The paper at arxiv.org


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 1, Informative) by Anonymous Coward on Monday January 08 2018, @05:10PM (2 children)

    by Anonymous Coward on Monday January 08 2018, @05:10PM (#619584)

    What's new here?

    • (Score: 2) by tibman on Monday January 08 2018, @09:48PM

      by tibman (134) Subscriber Badge on Monday January 08 2018, @09:48PM (#619732)

      Scientist discovers Computer Science?

      --
      SN won't survive on lurkers alone. Write comments.
    • (Score: 3, Informative) by FatPhil on Monday January 08 2018, @09:50PM

      by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Monday January 08 2018, @09:50PM (#619735) Homepage
      No, they've repurposed bloom filters.

      There's no tree here, at least from what's described in TFS. They've collapsed ("projected" to be more mathematical, but I guess you could also say "hashed", for some embarassingly trivial hash functions) a large index into 4 smaller flat indexes. No root, no internal nodes, no hierarchy, so no tree.

      Sounds shit, rather than "best ever" - if the two biggest hits are for 01234567 and 89ABCDEF equally, this algorithm will be unable to tell which two of 01234567, 012345EF, 0123CD67, 0123CDEF, ... 8 more ..., 89AB4567, 89AB45EF, 89ABCD67, and 89ABCDEF.

      One way of improving things is to have 7 indexes, and file 01234567 under respectively <01, 12, 23, 34, 45, 56, 67>, and 89ABCDEF under <89, 9A, AB, BC, CD, DE, EF>. By following the peaks, you can't "hop" from one number to the other, so that will identify just the two peaks, not any mix-and-match numbers.

      Which is rather similar to how google etc. do full-text-search, except they'll index trigrams or tetragrams. Documents containing "excr", "xcre", "cret", and "reta" almost certainly contain "excreta" (they do something more fancy, but requiring way more space, thus not particularly solving the "big" part of the "big data" problem.
      --
      Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
  • (Score: 0) by Anonymous Coward on Monday January 08 2018, @05:28PM (1 child)

    by Anonymous Coward on Monday January 08 2018, @05:28PM (#619588)

    I checked the paper to see how it performed and found nothing. Without looking closely, I bet this is like how it was mathematically proven that a single layered neural network was all you ever needed because it could approximate any function. Then this proof turned out to be irrelevant to any real world use case since it ignored time/energy/etc.

  • (Score: 4, Interesting) by bob_super on Monday January 08 2018, @06:06PM (3 children)

    by bob_super (1357) on Monday January 08 2018, @06:06PM (#619604)

    That second paragraph ignores the boundary effects. If you cut 1299 into 12 and 99, and then your input regularly produces numbers around 1300, you're going to miss that as one part registers 12 and 13, not necessarily high enough to get tagged as significant, and the other algo cannot see that 95-to-99 and 00-to-05 hits are actually linked.

    • (Score: 1, Insightful) by Anonymous Coward on Monday January 08 2018, @07:18PM (2 children)

      by Anonymous Coward on Monday January 08 2018, @07:18PM (#619648)

      I was thinking the same: this approach loses any and all correlation between these arbitrarily-chosen buckets. To keep in the spirit of IP addresses, if your hits are coming uniformly from all RFC1918 private ip addresses, this algorithm will probably identify 10.168.0.0/16 as the most prevalent source (or worse, if the 24-bit block is underrepresented, it will identify the non-existing source 172.168.x.y).

      • (Score: 1) by rylyeh on Tuesday January 09 2018, @05:10AM (1 child)

        by rylyeh (6726) <kadathNO@SPAMgmail.com> on Tuesday January 09 2018, @05:10AM (#619867)
        From the TFA:

        The main problem with this divide-and-conquer strategy is that while it’s easy to split a big number into small numbers, the reverse is trickier — it’s hard to fish out the right small numbers to recombine to give you the right big number.

        Imagine, for example, that your data stream frequently includes two numbers that have some digits in common: 12,345,678 and 12,999,999. Both start with 12. Your algorithm splits each number into four smaller numbers, then sends each to a sub-algorithm. Later, you ask each sub-algorithm, “Which numbers have you seen most frequently?” Copy one is going to say, “I’ve seen a lot of the number 12.” An algorithm that’s trying to identify which eight-digit numbers it’s seen most frequently can’t tell if all these 12s belong to one eight-digit number or, as in this case, to two different numbers.

        “The challenge is to figure out which two-digit blocks to concatenate with which other two-digit blocks,” Nelson said.

        The authors of the new work solve this dilemma by packaging each two-digit block with a little tag that doesn’t take up much memory but still allows the algorithm to put the two-digit pieces back together in the right way.

        To see one simple approach to how the tagging might work, start with 12,345,678 and split it into two-digit blocks. But this time, before you send each block to its respective sub-algorithm, package the block with a pair of unique identifying numbers that can be used to put the blocks back together. The first of these tags serves as the block’s name, the second as a link. In this way, 12,345,678 becomes: 12, 0, 1 34, 1, 2 56, 2, 3 78, 3, 4

        Here the number 12 has the name “0” and gets linked to the number named “1.” The number 34 has the name “1” and gets linked to the number named “2.” And so on.

        Now when the sub-algorithms return the two-digit blocks they’ve seen most frequently, 12 goes looking for a number tagged with “1” and finds 34, then 34 goes looking for a number tagged with “2” and finds 56, and 56 goes looking for a number tagged with “3” and finds 78.

        In this way, you can think of the two-digit blocks as links in a chain, with the links held together by these extra tagging numbers.

        The problem with chains, of course, is that they’re only as strong as their weakest link. And these chains are almost guaranteed to break. Building Blocks

        No algorithm works perfectly every time you run it — even the best ones misfire some small percentage of the time. In the example we’ve been using, a misfire could mean that the second two-digit block, 34, gets assigned an incorrect tag, and as a result, when it goes looking for the block it’s supposed to be joined to, it doesn’t have the information it needs to find 56. And once one link in the chain fails, the entire effort falls apart.

        Mikkel Thorup, a computer scientist at the University of Copenhagen, helped develop an error-resistant way of remembering data.

        To avoid this problem, the researchers use what’s called an “expander graph.” In an expander graph, each two-digit block forms a point. Points get connected by lines (according to the tagging process described above) to form a cluster. The important feature of an expander graph is that instead of merely connecting each point with its adjoining blocks, you connect each two-digit block with multiple other blocks. For example, with 12,345,678, you connect 12 with 34 but also with 56, so that you can still tell that 12 and 56 belong in the same number even if the link between 12 and 34 fails.

        An expander graph doesn’t always come out perfectly. Sometimes it’ll fail to link two blocks that should be linked. Or it’ll link two blocks that don’t belong together. To counteract this tendency, the researchers developed the final step of their algorithm: a “cluster-preserving” sub-algorithm that can survey an expander graph and accurately determine which points are meant to be clustered together and which aren’t, even when some lines are missing and false ones have been added.

        “This guarantees I can recover something that looks like the original clusters,” Thorup said.

        And while Twitter isn’t going to plug in the expander sketch tomorrow, the techniques underlying it are applicable to a far wider range of computer science problems than tallying tweets. The algorithm also proves that certain sacrifices that previously seemed necessary to answer the frequent-items problem don’t need to be made.

        I'm hearing that Bloom Clusters [sagepub.com] are the same, but this seems like a more standard tree approach to me. Of course, I didn't READ all of it.

        --
        "a vast crenulate shell wherein rode the grey and awful form of primal Nodens, Lord of the Great Abyss."
        • (Score: 1) by rylyeh on Tuesday January 09 2018, @05:18AM

          by rylyeh (6726) <kadathNO@SPAMgmail.com> on Tuesday January 09 2018, @05:18AM (#619871)

          OK - I read the details on the bloom filters and agree, on the surface it seems similar to me as well.
          My original thought was this 'best' algorithm reminded me of pointer and bytes.

          --
          "a vast crenulate shell wherein rode the grey and awful form of primal Nodens, Lord of the Great Abyss."
  • (Score: 0) by Anonymous Coward on Monday January 08 2018, @06:18PM (1 child)

    by Anonymous Coward on Monday January 08 2018, @06:18PM (#619613)

    I wonder if any Soylenters have heard of similar solutions.

    From a cursory look, what about a bloom filter?

    • (Score: 2) by frojack on Monday January 08 2018, @07:06PM

      by frojack (1554) Subscriber Badge on Monday January 08 2018, @07:06PM (#619644) Journal

      Isn't this exactly how any full text indexing system works, with numbers instead of words?

      Wiki even explains some of the problems [wikipedia.org] in retrieval from such systems, as mentioned in other comments above.

      --
      No, you are mistaken. I've always had this sig.
  • (Score: 2, Insightful) by Anonymous Coward on Monday January 08 2018, @06:29PM (3 children)

    by Anonymous Coward on Monday January 08 2018, @06:29PM (#619621)

    Should this be titled "Another Algorithm Found for Huge Streams of Data" instead of "best ever" being that time will continue for a while, and they may find something better tonight?

    • (Score: 2) by maxwell demon on Monday January 08 2018, @08:23PM

      by maxwell demon (1608) Subscriber Badge on Monday January 08 2018, @08:23PM (#619674) Journal

      Well, assuming this algorithm is really better than anything that existed before it, then "best so far" would be appropriate. "Another" is certainly too weak, as it says nothing at all about its quality. For example, if someone finds an unknown sort algorithm that is worse than bubble sort, that also qualifies as "another sort algorithm found", but probably would not even be newsworthy.

      --
      The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 1) by nitehawk214 on Monday January 08 2018, @08:48PM (1 child)

      by nitehawk214 (1304) on Monday January 08 2018, @08:48PM (#619691)

      Agreed. "Best Ever" makes it sound like they proved P = NP, in such that there is mathematically no better algorithm to be found.

      --
      "Don't you ever miss the days when you used to be nostalgic?" -Loiosh
      • (Score: 1, Funny) by Anonymous Coward on Monday January 08 2018, @09:50PM

        by Anonymous Coward on Monday January 08 2018, @09:50PM (#619734)

        then "best so far" would be appropriate.

        Agreed. "Best Ever" makes it sound like they proved P = NP,

        The author speaks Trumpese.

  • (Score: 2) by MichaelDavidCrawford on Monday January 08 2018, @08:34PM

    by MichaelDavidCrawford (2339) Subscriber Badge <mdcrawford@gmail.com> on Monday January 08 2018, @08:34PM (#619682) Homepage Journal

    Apache's Xerces-C (really C++) provides APIs for DOM and SAX.

    Some cluebot posted to its user list that Xerces-C was very slow because it required an hour to read a 20 MB XML file.

    Xereces-C's DOM implementation creates an object for each individual item in the input. SAX calls event handlers when an item is opened and when it is closed.

    --
    Yes I Have No Bananas. [gofundme.com]
  • (Score: 0) by Anonymous Coward on Monday January 08 2018, @09:14PM

    by Anonymous Coward on Monday January 08 2018, @09:14PM (#619706)

    Each guy on the assembly line only knows about the part he is making. He doesn't see the final product. But he knows every detail of that bracket he just made down to the millimeter. Maybe the word is specialization. Delegation? But where context matters, you can't lose sight of the peripheral outside your cubicle. Computers can handle indefinite number of inputs at once, and filter out a lot less than the brain has to. Inside, this is how the brain functions anyway. Unfortunately it does too good a job, so we have to build these infernal contraptions (frequency shifters, compressor/expanders is all they are) to see and hear all the "invisible/silent" shit! We wouldn't need them if we would just open our minds to the cacophony around us, the beeping and flashing lights, beeping, and flashing, I can't take it anymore!

  • (Score: 0) by Anonymous Coward on Monday January 08 2018, @10:43PM

    by Anonymous Coward on Monday January 08 2018, @10:43PM (#619752)

    Where have they been for over 40+years?? Sorting algorithm handlung tree structured processing God! Break work up to get more done in less space.

(1)