Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Wednesday May 01 2019, @10:21PM   Printer-friendly
from the it's-all-greek-to-me dept.

Submitted via IRC for Bytram

OCR4all: Modern tool for old texts

Historians and other humanities' scholars often have to deal with difficult research objects: centuries-old printed works that are difficult to decipher and often in an unsatisfactory state of conservation. Many of these documents have now been digitized—usually photographed or scanned—and are available online worldwide. For research purposes, this is already a step forward.

However, there is still a challenge to overcome: bringing the digitized old fonts into a modern form with text recognition software that is readable for non-specialists as well as for computers. Scientists at the Center for Philology and Digitality at Julius-Maximilians-Universität Würzburg (JMU) in Bavaria, Germany, have made a significant contribution to further development in this field.

With OCR4all, the JMU research team is making a new tool available to the scientific community. It converts digitized historical prints with an error rate of less than one percent into computer-readable texts. And it offers a graphical user interface that requires no IT expertise. With previous tools of this kind, user-friendliness was not always a given, as the users mostly had to work with programming commands.

[...] In developing OCR4all, computer scientists have collaborated with the humanities at JMU—including German and Romance studies and literature studies in the project "Narragonien digital." The aim was to digitize the "Narrenschiff," a moral satire by Sebastian Brant, a bestseller of the 15th century that was translated into many languages. Furthermore, OCR4all has been frequently used in the JMU's Kolleg "Medieval and Early Modern Times."

OCR4all is freely available to the public on the GitHub platform (with instructions and examples): https://github.com/OCR4all


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Interesting) by ewk on Thursday May 02 2019, @01:23PM (3 children)

    by ewk (5923) on Thursday May 02 2019, @01:23PM (#837841)

    "It converts digitized historical prints with an error rate of less than one percent into computer-readable texts."

    Error rate... per what? Per character? Per line? Per page? Per text?

    With less that one percent per character (let's assume 0.5%) we're still at an appr. 2 errors per line.

    --
    I don't always react, but when I do, I do it on SoylentNews
    Starting Score:    1  point
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 2) by dw861 on Thursday May 02 2019, @06:36PM

    by dw861 (1561) Subscriber Badge on Thursday May 02 2019, @06:36PM (#838034) Journal

    ewk makes a good point. This acknowledged, even if there are two errors per line it is still a vast improvement over sitting down and spending months at a time to decipher these texts, while searching for a particular term or phrase of interest. Nothing more depressing than reading an entire text only to decide that document holds nothing of value for your current research project. And then going on to the next doc.

    There is also a similar historical OCR project that does the same for handwritten archival documents.
    https://read.transkribus.eu/transkribus/ [transkribus.eu]

    An experiment with that from about a year ago indicated around 1/4 of all words had errors.
    https://blog.nationalarchives.gov.uk/blog/machines-reading-the-archive-handwritten-text-recognition-software/ [nationalarchives.gov.uk]

    Even with that error rate, I find this very helpful, and exciting.

  • (Score: 2) by danmars on Thursday May 02 2019, @07:50PM

    by danmars (3662) on Thursday May 02 2019, @07:50PM (#838077)

    Based on the numbers quoted, it's almost certainly the character rate.

  • (Score: 0) by Anonymous Coward on Thursday May 02 2019, @08:51PM

    by Anonymous Coward on Thursday May 02 2019, @08:51PM (#838113)

    OCR is step one, next step is manual hard labor, ala Project Gutenberg Distributed Proofreaders [pgdp.net] (also free software, like this OCR and many others)