Stories
Slash Boxes
Comments

SoylentNews is people

posted by n1 on Saturday June 28 2014, @09:51AM   Printer-friendly
from the /dev/null-grouping dept.

A new algorithm has been published that simplifies grouping data sets together according to their similarity, sometimes referred to as Cluster Analysis [CA].

Data sets can be imagined as "clouds" of data points in a multidimensional space. These points are generally differently distributed: more widely scattered in one area and denser in another. CA is used to identify the denser areas efficiently, grouping the data in a certain number of significant subsets on the basis of this criterion. Each subset corresponds to a category.

"Think of a database of facial photographs ", explains Alessandro Laio, professor of Statistical and Biological Physics at SISSA. "The database may contain more than one photo of the same person, so CA us used to group all the pictures of the same individual. This type of analysis is carried out by automatic facial recognition systems, for example".

"We tried to devise a more efficient algorithm than those currently used, and one capable of solving some of the classic problems of CA", continues Laio.

"Our approach is based on a new way of identifying the centre of the cluster, i.e., the subsets", explains Alex Rodrigez, co-author of the paper. "Imagine having to identify all the cities in the world, without having access to a map. A huge task", says Rodriguez. "We therefore identified a heuristic, that is, a simple rule or a sort of shortcut to achieve the result".

To find out if a place is a city, we can ask each inhabitant to count his "neighbours", in other words, how many people live within 100 metres from his house. Once we have this number, we then go on to find, for each inhabitant, the shortest distance at which another inhabitant with a greater number of neighbours lives. "Together, these two data", explains Laio, "tell us how densely populated is the area where an individual lives and the distance between individuals who have the most neighbours. By automatically cross-checking these data, for the entire world population, we can identify the individuals who represent the centres of the clusters, which correspond to the various cities". "Our algorithm performs precisely this kind of calculation, and it can be applied to many different settings", adds Rodriguez.

Abstract: http://www.sciencemag.org/content/344/6191/1492

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by Geotti on Saturday June 28 2014, @09:48PM

    by Geotti (1146) on Saturday June 28 2014, @09:48PM (#61412) Journal

    The abstract is here: http://www.sciencemag.org/content/344/6191/1492 [sciencemag.org] (Clustering by fast search and find of density peaks)

    I do have access, but the paper is only ~5 pages long without the supplementary material.

    If you don't have access, you can usually ask the author for a personal copy, their emails are: alexrod@sissa.it (Alex Rodriguez) and laio@sissa.it (Alessandro Laio).

    But yeah, I was crossing my fingers that we'd have a subscription, when I hit refresh after connecting to my institute's VPN.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2