SoylentNews Comments | New Algorithm Simplifies the Categorization of Large Amount of Data

New Algorithm Simplifies the Categorization of Large Amount of Data

posted by n1 on Saturday June 28 2014, @09:51AM

from the /dev/null-grouping dept.

lhsi writes:

A new algorithm has been published that simplifies grouping data sets together according to their similarity, sometimes referred to as Cluster Analysis [CA].

Data sets can be imagined as "clouds" of data points in a multidimensional space. These points are generally differently distributed: more widely scattered in one area and denser in another. CA is used to identify the denser areas efficiently, grouping the data in a certain number of significant subsets on the basis of this criterion. Each subset corresponds to a category.
"Think of a database of facial photographs ", explains Alessandro Laio, professor of Statistical and Biological Physics at SISSA. "The database may contain more than one photo of the same person, so CA us used to group all the pictures of the same individual. This type of analysis is carried out by automatic facial recognition systems, for example".
"We tried to devise a more efficient algorithm than those currently used, and one capable of solving some of the classic problems of CA", continues Laio.

"Our approach is based on a new way of identifying the centre of the cluster, i.e., the subsets", explains Alex Rodrigez, co-author of the paper. "Imagine having to identify all the cities in the world, without having access to a map. A huge task", says Rodriguez. "We therefore identified a heuristic, that is, a simple rule or a sort of shortcut to achieve the result".
To find out if a place is a city, we can ask each inhabitant to count his "neighbours", in other words, how many people live within 100 metres from his house. Once we have this number, we then go on to find, for each inhabitant, the shortest distance at which another inhabitant with a greater number of neighbours lives. "Together, these two data", explains Laio, "tell us how densely populated is the area where an individual lives and the distance between individuals who have the most neighbours. By automatically cross-checking these data, for the entire world population, we can identify the individuals who represent the centres of the clusters, which correspond to the various cities". "Our algorithm performs precisely this kind of calculation, and it can be applied to many different settings", adds Rodriguez.

Abstract: http://www.sciencemag.org/content/344/6191/1492

This discussion has been archived. No new comments can be posted.

New Algorithm Simplifies the Categorization of Large Amount of Data | Log In/Create an Account | Top | 11 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

it's racist profiling! it's racist profiling! (Score: 0) by Anonymous Coward on Saturday June 28 2014, @10:18AM

by Anonymous Coward on Saturday June 28 2014, @10:18AM (#61306)

for data! Free the innocent bits!
- Re:it's racist profiling! Re:it's racist profiling! (Score: 4, Insightful) by c0lo on Saturday June 28 2014, @10:39AM
  
  by c0lo (156) on Saturday June 28 2014, @10:39AM (#61311) Journal
  
  Free the innocent bits!
  Start with TFA (fscking journals. $20 for accessing one paper and this even I before know if it worths it! Why, that the subscription fee for a whole year on SN!).
  
  --
  https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
  
  Parent
  - Re:it's racist profiling! Re:it's racist profiling! (Score: 1, Informative) by Anonymous Coward on Saturday June 28 2014, @01:46PM
    
    by Anonymous Coward on Saturday June 28 2014, @01:46PM (#61329)
    
    The online science/technical journals are doing a bang-up job
    locking up scientific progress they bought from reasearchers
    (sometimes at taxpayer expense) behind a paywall.
    I couldn't find a free source of a particular piece of information I found I could use.
    All I could find were abstracts and paywalls for this information.
    Oh well....
    Scientists gotta eat too....
    It's ironic that I could go on YouTube and find just about any
    old/obscure/out of print entertainment media I wanted to see
    and listen to.
    Aaron Swartz tried buck the system and 'free' scientific research
    and commited suicide for his troubles as a result of UNBEARABLE
    legal pressure and the prospect spending MOST of his life behind
    bars in prison for systematically liberating large(huge?) amounts
    of taxpayer-funded research that was being collected and put
    behind a wall requiring exclusive membership or the payment
    of money in order to access it....
    http://en.wikipedia.org/wiki/Aaron_Swartz [wikipedia.org]
    
    Parent
    - Re:it's racist profiling! (Score: 3, Insightful) by opinionated_science on Saturday June 28 2014, @03:36PM
      
      by opinionated_science (4031) on Saturday June 28 2014, @03:36PM (#61351)
      
      yes, and we in the sciences are well aware of this. We can try and publish in open-access but it costs $$ too. In effect, all publicly funded research (any pecentage), should be open access, perhaps after 6 months? Only privately funded research should be in paid for journals, since ultimately it helps their bottom line.
      
      Parent
  - Re:it's racist profiling! (Score: 2) by Geotti on Saturday June 28 2014, @09:48PM
    
    by Geotti (1146) on Saturday June 28 2014, @09:48PM (#61412) Journal
    
    The abstract is here: http://www.sciencemag.org/content/344/6191/1492 [sciencemag.org] (Clustering by fast search and find of density peaks)
    I do have access, but the paper is only ~5 pages long without the supplementary material.
    If you don't have access, you can usually ask the author for a personal copy, their emails are: alexrod@sissa.it (Alex Rodriguez) and laio@sissa.it (Alessandro Laio).
    But yeah, I was crossing my fingers that we'd have a subscription, when I hit refresh after connecting to my institute's VPN.
    
    Parent
Considerations Considerations (Score: 2) by meisterister on Saturday June 28 2014, @03:41PM

by meisterister (949) on Saturday June 28 2014, @03:41PM (#61353) Journal

My main concern: how fast is this algorithm? I'm petty sure that you can do this with an SVM (support vector machine), but, if I'm not mistaken, it takes too long to get very fine separation on large datasets. Also, if you define the different regions too tightly, then the algorithm doesn't work well on anything but the training dataset.

--
(May or may not have been) Posted from my K6-2, Athlon XP, or Pentium I/II/III.
- Re:Considerations (Score: 3, Interesting) by c0lo on Sunday June 29 2014, @12:14AM
  
  by c0lo (156) on Sunday June 29 2014, @12:14AM (#61455) Journal
  
  My main concern: how fast is this algorithm?
  Can't be lower than N*(N-1)/2, because the "distances" between the all the samples need to be computed at least once.
  I'm petty sure that you can do this with an SVM (support vector machine)
  The way I know, SVM is in the "supervised learning" category - meaning the classification will require an a-priory knowledge about the number of classes to be recognized and, for each class, a rich enough number of samples that the "supervisor" tagged in advance.
  This algo seems to figure by itself which/how many are classes that need recognized (self-organized learning) - probably the "neighborhood threshold-distance" determines the number of classes detected.
  
  --
  https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
  
  Parent
- Re:Considerations Re:Considerations (Score: 2, Interesting) by TGV on Sunday June 29 2014, @05:19AM
  
  by TGV (2838) on Sunday June 29 2014, @05:19AM (#61521)
  
  It seems to me SVMs work differently from the description in the article, which sounds more like k-clustering. It also sounds as if it could be implemented reasonably efficiently, possibly faster than SVM (which is quite difficult to implement efficiently for large sets), but the devil is as always in the details. This might be a heuristic that works in certain cases, but perhaps these cases happen to be of practical interest.
  
  Parent
  - Re:Considerations (Score: 2) by opinionated_science on Sunday June 29 2014, @02:36PM
    
    by opinionated_science (4031) on Sunday June 29 2014, @02:36PM (#61636)
    
    it would be nice if they provided some code....!
    
    Parent
A new spin? (Score: 0) by Anonymous Coward on Saturday June 28 2014, @03:58PM

by Anonymous Coward on Saturday June 28 2014, @03:58PM (#61356)

It seems like everyone is trying to take something old and give it a new spin. Given how advanced we are in our understanding of mathematics and statistics I find it very hard to believe something like this is that novel. The next step ... get a patent!!!
algorithm to find different groups of preferences? (Score: 3, Interesting) by TheLink on Saturday June 28 2014, @06:18PM

by TheLink (332) on Saturday June 28 2014, @06:18PM (#61375) Journal

Say you have a site where everyone can "rate" anything and maybe put a short review. Then you run some algos to go through all the ratings and try to find groups of similar people. Then you can allow people to see things through other people's perspectives, not merely "others who bought X also bought Y". This could make it easier to buy gifts for other people, without hopefully the silliness of Google/Amazon/etc recommending My Little Pony to you forever just because you bought your niece a present.

To me Amazon or even Facebook are in a good position to do something like this.

But what algorithm would be good?

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

New Algorithm Simplifies the Categorization of Large Amount of Data

it's racist profiling! it's racist profiling! (Score: 0) by Anonymous Coward on Saturday June 28 2014, @10:18AM

Re:it's racist profiling! Re:it's racist profiling! (Score: 4, Insightful) by c0lo on Saturday June 28 2014, @10:39AM

Re:it's racist profiling! Re:it's racist profiling! (Score: 1, Informative) by Anonymous Coward on Saturday June 28 2014, @01:46PM

Re:it's racist profiling! (Score: 3, Insightful) by opinionated_science on Saturday June 28 2014, @03:36PM

Re:it's racist profiling! (Score: 2) by Geotti on Saturday June 28 2014, @09:48PM

Considerations Considerations (Score: 2) by meisterister on Saturday June 28 2014, @03:41PM

Re:Considerations (Score: 3, Interesting) by c0lo on Sunday June 29 2014, @12:14AM

Re:Considerations Re:Considerations (Score: 2, Interesting) by TGV on Sunday June 29 2014, @05:19AM

Re:Considerations (Score: 2) by opinionated_science on Sunday June 29 2014, @02:36PM

A new spin? (Score: 0) by Anonymous Coward on Saturday June 28 2014, @03:58PM

algorithm to find different groups of preferences? (Score: 3, Interesting) by TheLink on Saturday June 28 2014, @06:18PM