A new algorithm has been published that simplifies grouping data sets together according to their similarity, sometimes referred to as Cluster Analysis [CA].
Data sets can be imagined as "clouds" of data points in a multidimensional space. These points are generally differently distributed: more widely scattered in one area and denser in another. CA is used to identify the denser areas efficiently, grouping the data in a certain number of significant subsets on the basis of this criterion. Each subset corresponds to a category.
"Think of a database of facial photographs ", explains Alessandro Laio, professor of Statistical and Biological Physics at SISSA. "The database may contain more than one photo of the same person, so CA us used to group all the pictures of the same individual. This type of analysis is carried out by automatic facial recognition systems, for example".
"We tried to devise a more efficient algorithm than those currently used, and one capable of solving some of the classic problems of CA", continues Laio.
"Our approach is based on a new way of identifying the centre of the cluster, i.e., the subsets", explains Alex Rodrigez, co-author of the paper. "Imagine having to identify all the cities in the world, without having access to a map. A huge task", says Rodriguez. "We therefore identified a heuristic, that is, a simple rule or a sort of shortcut to achieve the result".
To find out if a place is a city, we can ask each inhabitant to count his "neighbours", in other words, how many people live within 100 metres from his house. Once we have this number, we then go on to find, for each inhabitant, the shortest distance at which another inhabitant with a greater number of neighbours lives. "Together, these two data", explains Laio, "tell us how densely populated is the area where an individual lives and the distance between individuals who have the most neighbours. By automatically cross-checking these data, for the entire world population, we can identify the individuals who represent the centres of the clusters, which correspond to the various cities". "Our algorithm performs precisely this kind of calculation, and it can be applied to many different settings", adds Rodriguez.
(Score: 0) by Anonymous Coward on Saturday June 28 2014, @10:18AM
for data! Free the innocent bits!
(Score: 4, Insightful) by c0lo on Saturday June 28 2014, @10:39AM
Start with TFA (fscking journals. $20 for accessing one paper and this even I before know if it worths it! Why, that the subscription fee for a whole year on SN!).
https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
(Score: 1, Informative) by Anonymous Coward on Saturday June 28 2014, @01:46PM
The online science/technical journals are doing a bang-up job
locking up scientific progress they bought from reasearchers
(sometimes at taxpayer expense) behind a paywall.
I couldn't find a free source of a particular piece of information I found I could use.
All I could find were abstracts and paywalls for this information.
Oh well....
Scientists gotta eat too....
It's ironic that I could go on YouTube and find just about any
old/obscure/out of print entertainment media I wanted to see
and listen to.
Aaron Swartz tried buck the system and 'free' scientific research
and commited suicide for his troubles as a result of UNBEARABLE
legal pressure and the prospect spending MOST of his life behind
bars in prison for systematically liberating large(huge?) amounts
of taxpayer-funded research that was being collected and put
behind a wall requiring exclusive membership or the payment
of money in order to access it....
http://en.wikipedia.org/wiki/Aaron_Swartz [wikipedia.org]
(Score: 3, Insightful) by opinionated_science on Saturday June 28 2014, @03:36PM
yes, and we in the sciences are well aware of this. We can try and publish in open-access but it costs $$ too. In effect, all publicly funded research (any pecentage), should be open access, perhaps after 6 months? Only privately funded research should be in paid for journals, since ultimately it helps their bottom line.
(Score: 2) by Geotti on Saturday June 28 2014, @09:48PM
The abstract is here: http://www.sciencemag.org/content/344/6191/1492 [sciencemag.org] (Clustering by fast search and find of density peaks)
I do have access, but the paper is only ~5 pages long without the supplementary material.
If you don't have access, you can usually ask the author for a personal copy, their emails are: alexrod@sissa.it (Alex Rodriguez) and laio@sissa.it (Alessandro Laio).
But yeah, I was crossing my fingers that we'd have a subscription, when I hit refresh after connecting to my institute's VPN.
(Score: 2) by meisterister on Saturday June 28 2014, @03:41PM
My main concern: how fast is this algorithm? I'm petty sure that you can do this with an SVM (support vector machine), but, if I'm not mistaken, it takes too long to get very fine separation on large datasets. Also, if you define the different regions too tightly, then the algorithm doesn't work well on anything but the training dataset.
(May or may not have been) Posted from my K6-2, Athlon XP, or Pentium I/II/III.
(Score: 3, Interesting) by c0lo on Sunday June 29 2014, @12:14AM
Can't be lower than N*(N-1)/2, because the "distances" between the all the samples need to be computed at least once.
The way I know, SVM is in the "supervised learning" category - meaning the classification will require an a-priory knowledge about the number of classes to be recognized and, for each class, a rich enough number of samples that the "supervisor" tagged in advance.
This algo seems to figure by itself which/how many are classes that need recognized (self-organized learning) - probably the "neighborhood threshold-distance" determines the number of classes detected.
https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
(Score: 2, Interesting) by TGV on Sunday June 29 2014, @05:19AM
It seems to me SVMs work differently from the description in the article, which sounds more like k-clustering. It also sounds as if it could be implemented reasonably efficiently, possibly faster than SVM (which is quite difficult to implement efficiently for large sets), but the devil is as always in the details. This might be a heuristic that works in certain cases, but perhaps these cases happen to be of practical interest.
(Score: 2) by opinionated_science on Sunday June 29 2014, @02:36PM
it would be nice if they provided some code....!
(Score: 0) by Anonymous Coward on Saturday June 28 2014, @03:58PM
It seems like everyone is trying to take something old and give it a new spin. Given how advanced we are in our understanding of mathematics and statistics I find it very hard to believe something like this is that novel. The next step ... get a patent!!!
(Score: 3, Interesting) by TheLink on Saturday June 28 2014, @06:18PM
To me Amazon or even Facebook are in a good position to do something like this.
But what algorithm would be good?