Slash Boxes

SoylentNews is people

posted by n1 on Saturday June 28 2014, @09:51AM   Printer-friendly
from the /dev/null-grouping dept.

A new algorithm has been published that simplifies grouping data sets together according to their similarity, sometimes referred to as Cluster Analysis [CA].

Data sets can be imagined as "clouds" of data points in a multidimensional space. These points are generally differently distributed: more widely scattered in one area and denser in another. CA is used to identify the denser areas efficiently, grouping the data in a certain number of significant subsets on the basis of this criterion. Each subset corresponds to a category.

"Think of a database of facial photographs ", explains Alessandro Laio, professor of Statistical and Biological Physics at SISSA. "The database may contain more than one photo of the same person, so CA us used to group all the pictures of the same individual. This type of analysis is carried out by automatic facial recognition systems, for example".

"We tried to devise a more efficient algorithm than those currently used, and one capable of solving some of the classic problems of CA", continues Laio.

"Our approach is based on a new way of identifying the centre of the cluster, i.e., the subsets", explains Alex Rodrigez, co-author of the paper. "Imagine having to identify all the cities in the world, without having access to a map. A huge task", says Rodriguez. "We therefore identified a heuristic, that is, a simple rule or a sort of shortcut to achieve the result".

To find out if a place is a city, we can ask each inhabitant to count his "neighbours", in other words, how many people live within 100 metres from his house. Once we have this number, we then go on to find, for each inhabitant, the shortest distance at which another inhabitant with a greater number of neighbours lives. "Together, these two data", explains Laio, "tell us how densely populated is the area where an individual lives and the distance between individuals who have the most neighbours. By automatically cross-checking these data, for the entire world population, we can identify the individuals who represent the centres of the clusters, which correspond to the various cities". "Our algorithm performs precisely this kind of calculation, and it can be applied to many different settings", adds Rodriguez.


This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Saturday June 28 2014, @10:18AM

    by Anonymous Coward on Saturday June 28 2014, @10:18AM (#61306)

    for data! Free the innocent bits!

    • (Score: 4, Insightful) by c0lo on Saturday June 28 2014, @10:39AM

      by c0lo (156) Subscriber Badge on Saturday June 28 2014, @10:39AM (#61311) Journal

      Free the innocent bits!

      Start with TFA (fscking journals. $20 for accessing one paper and this even I before know if it worths it! Why, that the subscription fee for a whole year on SN!).

      • (Score: 1, Informative) by Anonymous Coward on Saturday June 28 2014, @01:46PM

        by Anonymous Coward on Saturday June 28 2014, @01:46PM (#61329)

        The online science/technical journals are doing a bang-up job
        locking up scientific progress they bought from reasearchers
        (sometimes at taxpayer expense) behind a paywall.

        I couldn't find a free source of a particular piece of information I found I could use.

        All I could find were abstracts and paywalls for this information.

        Oh well....

        Scientists gotta eat too....

        It's ironic that I could go on YouTube and find just about any
        old/obscure/out of print entertainment media I wanted to see
        and listen to.

        Aaron Swartz tried buck the system and 'free' scientific research
        and commited suicide for his troubles as a result of UNBEARABLE
        legal pressure and the prospect spending MOST of his life behind
        bars in prison for systematically liberating large(huge?) amounts
        of taxpayer-funded research that was being collected and put
        behind a wall requiring exclusive membership or the payment
        of money in order to access it.... []

        • (Score: 3, Insightful) by opinionated_science on Saturday June 28 2014, @03:36PM

          by opinionated_science (4031) on Saturday June 28 2014, @03:36PM (#61351)

          yes, and we in the sciences are well aware of this. We can try and publish in open-access but it costs $$ too. In effect, all publicly funded research (any pecentage), should be open access, perhaps after 6 months? Only privately funded research should be in paid for journals, since ultimately it helps their bottom line.

      • (Score: 2) by Geotti on Saturday June 28 2014, @09:48PM

        by Geotti (1146) on Saturday June 28 2014, @09:48PM (#61412) Journal

        The abstract is here: [] (Clustering by fast search and find of density peaks)

        I do have access, but the paper is only ~5 pages long without the supplementary material.

        If you don't have access, you can usually ask the author for a personal copy, their emails are: (Alex Rodriguez) and (Alessandro Laio).

        But yeah, I was crossing my fingers that we'd have a subscription, when I hit refresh after connecting to my institute's VPN.

  • (Score: 2) by meisterister on Saturday June 28 2014, @03:41PM

    by meisterister (949) on Saturday June 28 2014, @03:41PM (#61353) Journal

    My main concern: how fast is this algorithm? I'm petty sure that you can do this with an SVM (support vector machine), but, if I'm not mistaken, it takes too long to get very fine separation on large datasets. Also, if you define the different regions too tightly, then the algorithm doesn't work well on anything but the training dataset.

    (May or may not have been) Posted from my K6-2, Athlon XP, or Pentium I/II/III.
    • (Score: 3, Interesting) by c0lo on Sunday June 29 2014, @12:14AM

      by c0lo (156) Subscriber Badge on Sunday June 29 2014, @12:14AM (#61455) Journal

      My main concern: how fast is this algorithm?

      Can't be lower than N*(N-1)/2, because the "distances" between the all the samples need to be computed at least once.

      I'm petty sure that you can do this with an SVM (support vector machine)

      The way I know, SVM is in the "supervised learning" category - meaning the classification will require an a-priory knowledge about the number of classes to be recognized and, for each class, a rich enough number of samples that the "supervisor" tagged in advance.
      This algo seems to figure by itself which/how many are classes that need recognized (self-organized learning) - probably the "neighborhood threshold-distance" determines the number of classes detected.

    • (Score: 2, Interesting) by TGV on Sunday June 29 2014, @05:19AM

      by TGV (2838) on Sunday June 29 2014, @05:19AM (#61521)

      It seems to me SVMs work differently from the description in the article, which sounds more like k-clustering. It also sounds as if it could be implemented reasonably efficiently, possibly faster than SVM (which is quite difficult to implement efficiently for large sets), but the devil is as always in the details. This might be a heuristic that works in certain cases, but perhaps these cases happen to be of practical interest.

  • (Score: 0) by Anonymous Coward on Saturday June 28 2014, @03:58PM

    by Anonymous Coward on Saturday June 28 2014, @03:58PM (#61356)

    It seems like everyone is trying to take something old and give it a new spin. Given how advanced we are in our understanding of mathematics and statistics I find it very hard to believe something like this is that novel. The next step ... get a patent!!!

  • (Score: 3, Interesting) by TheLink on Saturday June 28 2014, @06:18PM

    by TheLink (332) on Saturday June 28 2014, @06:18PM (#61375) Journal
    Say you have a site where everyone can "rate" anything and maybe put a short review. Then you run some algos to go through all the ratings and try to find groups of similar people. Then you can allow people to see things through other people's perspectives, not merely "others who bought X also bought Y". This could make it easier to buy gifts for other people, without hopefully the silliness of Google/Amazon/etc recommending My Little Pony to you forever just because you bought your niece a present.

    To me Amazon or even Facebook are in a good position to do something like this.

    But what algorithm would be good?