Stories
Slash Boxes
Comments

SoylentNews is people

posted by n1 on Saturday August 09 2014, @04:09AM   Printer-friendly
from the substandard-quality-assurance dept.

Computer Scientists in China have developed an algorithm that can automatically rank Wikipedia articles on quality using Bayesian statistics.

The notion of finding evidence based on an analysis of probabilities was first described by 18th Century mathematician and theologian Thomas Bayes. Bayesian probabilities were then utilized by Pierre-Simon Laplace to pioneer a new statistical method. Today, Bayesian analysis is commonly used to assess the content of emails and to determine the probability that the content is spam, junk mail, and so filter it from the user's inbox if the probability is high.

Han and Chen have now used dynamic Bayesian network (DBN) to analyze in a similar manner the content of Wikipedia entries. They apply multivariate Gaussian distribution modeling to the DBN analysis, which gives them a distribution of the quality of each article so that entries might be ranked. Very low-ranking entries might be flagged for editorial attention to raise the quality. By contrast, high-ranking entries could be marked in some way as the definitive entry so that such an entry is not subsequently overwritten with lower quality information.

The team has tested its algorithm on sets of several hundred articles comparing the automated quality assessment by the computer with assessment by a human user. Their algorithm out-performs a human user by up to 23 percent in correctly classifying the quality rank of a given article in the set, the team reports. The use of a computerized system to provide a quality standard for Wikipedia entries would avoid the subjective need to have people classify each entry. It could thus improve the standard as well as provide a basis for an improved reputation for the online encyclopedia.

Abstract: http://www.inderscience.com/offer.php?id=64056

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Interesting) by elf on Saturday August 09 2014, @07:58AM

    by elf (64) on Saturday August 09 2014, @07:58AM (#79255)

    Are they measuring... quality of the information? quality of the language used? quality of the readability?. I have no idea!

    The definition of bayesian probability says the measure is either the plausibility of propositions (objective approach) or the probability corresponds to a 'personal belief' (subjective(From wikipedia), if you are still confused as to what they measured then join the club :)

    The article says they took several hundred articles, ranked their quality (again what quality!) and then the pitted humans vs the algorithm to see who got the most "right" answers. To summarise, some humans (presumably the testers) arbitrarily ranked the articles and then more humans did the same thing, but ranked them differently...what a surprise they were different!

    • (Score: 3, Insightful) by maxwell demon on Saturday August 09 2014, @12:09PM

      by maxwell demon (1608) Subscriber Badge on Saturday August 09 2014, @12:09PM (#79292) Journal

      Nice quote:

      "Their algorithm out-performs a human user by up to 23 percent in correctly classifying the quality rank of a given article in the set, the team reports."

      So they say the software assessed quality better than a human. Quite obviously "quality" here means something different than what we usually understand it to mean, because the usual notion of quality is based on human perception of quality.

      --
      The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 2) by Fnord666 on Saturday August 09 2014, @08:25PM

      by Fnord666 (652) Subscriber Badge on Saturday August 09 2014, @08:25PM (#79432) Homepage

      Are they measuring... quality of the information? quality of the language used? quality of the readability?. I have no idea!

      Based on the following:

      Computer Scientists in China

      My guess would be how well the article conforms to the state sponsored view of the topic at hand.

  • (Score: 2) by kaszz on Saturday August 09 2014, @11:38AM

    by kaszz (4211) on Saturday August 09 2014, @11:38AM (#79282) Journal

    Now the state may rank which articles are most likely to upset the people to being screwed all their life. And subsequently censor and autoedit any entry. Progress!

    Though it is an interesting work that may have uses. Like finding articles in need of improvement and comparing versions of more mature articles. But the abuse possibility is there too.