Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Tuesday December 13 2016, @07:46AM   Printer-friendly
from the come-on-feel-the-noize dept.

This insight into the information which can be gleaned from data is cool and worrisome by equal measures.

Early in his talk, computer scientist John Hopcroft noted a funny fact about clustering algorithms: they work better on synthetic data than real data. But this is more than an odd tidbit about software.

[...] When we invent our own synthetic data, we try to mimic real data by mixing true information with random distraction–combining "signal" with "noise." But in real data, the divide isn't so clear. What often looks like noise turns out to be the deep structure we haven't grasped yet.

Hopcroft's insight: data doesn't just have one structure. It has many. If I scanned notebooks from a hundred people, and made a database of all the individual letters, I could sort them lots of ways. Alphabetically. Capital/lowercase. Size. Darkness. Handwriting. Each of these is a different layer of structure.

And to understand data–and the world–you've got to reckon with all those layers.

The part of the video which discusses the above starts around 5:45.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: -1, Troll) by Anonymous Coward on Tuesday December 13 2016, @07:53AM

    by Anonymous Coward on Tuesday December 13 2016, @07:53AM (#440715)

    niggers niggers niggers niggers

  • (Score: 0) by Anonymous Coward on Tuesday December 13 2016, @08:39AM

    by Anonymous Coward on Tuesday December 13 2016, @08:39AM (#440723)

    For example, Hopcroft and his colleagues ran their algorithm on Facebook data from Rice University. They had sparse information: no names, no profiles, just who was friends with whom—a skeleton network of connections. Based on this, their algorithm quickly sorted the students into nine clusters.

    I can't help but wonder if there was a facebook profile for each dorm, that you would only friend to get news if you were part of that dorm... and similar for the year level.

    I'd also expect that they looked for "communities" within each cluster. In the case of the dorms you would expect to see 1-4 communities found proportional to the distribution of student years for each of the 9. There is something wrong about treating these as parallel rather than hierarchical clusters I think, maybe not though since I can't put my finger on it at the moment... but I feel like it is throwing out information.

    • (Score: -1, Offtopic) by Anonymous Coward on Tuesday December 13 2016, @12:31PM

      by Anonymous Coward on Tuesday December 13 2016, @12:31PM (#440757)

      Maybe he was trying to be politically correct?

  • (Score: 3, Insightful) by Anonymous Coward on Tuesday December 13 2016, @08:45AM

    by Anonymous Coward on Tuesday December 13 2016, @08:45AM (#440725)

    Was going to read but video? ugg.

  • (Score: 2) by wonkey_monkey on Tuesday December 13 2016, @08:50AM

    by wonkey_monkey (279) on Tuesday December 13 2016, @08:50AM (#440726) Homepage

    Sometimes the Noise is Signals, Too

    Why's it so smokey in here?

    --
    systemd is Roko's Basilisk
  • (Score: 2) by VLM on Tuesday December 13 2016, @12:48PM

    by VLM (445) on Tuesday December 13 2016, @12:48PM (#440766)

    I'm not watching the video, too slow.

    John Hopcroft

    Ah ha that name sounded familiar and I took an automata theory class with Ullman, and Ullman and Hopcroft coauthored the (currently dominant?) textbook on automata theory. I always figured Hopcroft was an automata guy but most CS folks are pretty wide ranging so I guess its no great surprise he's talking randomness and stuff.

    At least from the summary, look on the bright side, you're a lot better off if your clustering algo proves your measured randomness isn't random than if you clustering algo proves your artificial randomness source isn't random.

    • (Score: 1) by charon on Tuesday December 13 2016, @06:21PM

      by charon (5660) on Tuesday December 13 2016, @06:21PM (#440901) Journal
      I included the video link for completeness' sake, but the interesting portion of the talk is the part written about by the article author.
      • (Score: 2) by VLM on Tuesday December 13 2016, @06:44PM

        by VLM (445) on Tuesday December 13 2016, @06:44PM (#440920)

        I was thinking more about the topic and its interesting that he specifically mentioned clustering algos as being kinda unreliable whereas stuff like telecom/rf data transfer proves out deep into the decimal places.

        They should be very similar, after all here's an analog stream of QAM where you cluster the instantaneous phase and amplitude into certain bins where each bin passes multiple bits of data, and it works very well in practice. So you'd expect that a more generic clustering algo fed something other than a telecom RF data feed would work as well or better but he implies theres a lot of fuzzy going on such that it doesn't.

        Can't even use the excuse of low SNR, thanks to the physicists and EEs you can do crazy low SNR telecom stuff with space probes that works out to the theoretical limit pretty well indeed... unless you believe the whole space program / moon landing stuff is a hoax. So crazy low SNR census data correlations or whatever should cluster well, like, hey why not?

  • (Score: -1, Offtopic) by Anonymous Coward on Tuesday December 13 2016, @02:06PM

    by Anonymous Coward on Tuesday December 13 2016, @02:06PM (#440781)

    Would you pay a few thousand bucks for a vacation to Germany if it also included this? [heidelberg-laureate-forum.org]

  • (Score: 0) by Anonymous Coward on Tuesday December 13 2016, @07:28PM

    by Anonymous Coward on Tuesday December 13 2016, @07:28PM (#440947)

    Let this algorithm figure out the unhealthy tissue in a scan of your (child's) brain. Go ahead with the surgery.

    "2 mysterious layers" lol... operate on them or no? Come on, decide - it may be CANCER.

    More likely: There are 2 mysterious layers, Sir. Send in the drones, Sir? Obviously yes, but how many drones? A squadron or the entire battle fleet?

  • (Score: 0) by Anonymous Coward on Tuesday December 13 2016, @11:37PM

    by Anonymous Coward on Tuesday December 13 2016, @11:37PM (#441073)

    two mystery layers?
    maybe male and female and straight and gay?