Submitted via IRC for Bytram
OCR4all: Modern tool for old texts
Historians and other humanities' scholars often have to deal with difficult research objects: centuries-old printed works that are difficult to decipher and often in an unsatisfactory state of conservation. Many of these documents have now been digitized—usually photographed or scanned—and are available online worldwide. For research purposes, this is already a step forward.
However, there is still a challenge to overcome: bringing the digitized old fonts into a modern form with text recognition software that is readable for non-specialists as well as for computers. Scientists at the Center for Philology and Digitality at Julius-Maximilians-Universität Würzburg (JMU) in Bavaria, Germany, have made a significant contribution to further development in this field.
With OCR4all, the JMU research team is making a new tool available to the scientific community. It converts digitized historical prints with an error rate of less than one percent into computer-readable texts. And it offers a graphical user interface that requires no IT expertise. With previous tools of this kind, user-friendliness was not always a given, as the users mostly had to work with programming commands.
[...] In developing OCR4all, computer scientists have collaborated with the humanities at JMU—including German and Romance studies and literature studies in the project "Narragonien digital." The aim was to digitize the "Narrenschiff," a moral satire by Sebastian Brant, a bestseller of the 15th century that was translated into many languages. Furthermore, OCR4all has been frequently used in the JMU's Kolleg "Medieval and Early Modern Times."
OCR4all is freely available to the public on the GitHub platform (with instructions and examples): https://github.com/OCR4all
(Score: 3, Interesting) by ewk on Thursday May 02 2019, @01:23PM (3 children)
"It converts digitized historical prints with an error rate of less than one percent into computer-readable texts."
Error rate... per what? Per character? Per line? Per page? Per text?
With less that one percent per character (let's assume 0.5%) we're still at an appr. 2 errors per line.
I don't always react, but when I do, I do it on SoylentNews
(Score: 2) by dw861 on Thursday May 02 2019, @06:36PM
ewk makes a good point. This acknowledged, even if there are two errors per line it is still a vast improvement over sitting down and spending months at a time to decipher these texts, while searching for a particular term or phrase of interest. Nothing more depressing than reading an entire text only to decide that document holds nothing of value for your current research project. And then going on to the next doc.
There is also a similar historical OCR project that does the same for handwritten archival documents.
https://read.transkribus.eu/transkribus/ [transkribus.eu]
An experiment with that from about a year ago indicated around 1/4 of all words had errors.
https://blog.nationalarchives.gov.uk/blog/machines-reading-the-archive-handwritten-text-recognition-software/ [nationalarchives.gov.uk]
Even with that error rate, I find this very helpful, and exciting.
(Score: 2) by danmars on Thursday May 02 2019, @07:50PM
Based on the numbers quoted, it's almost certainly the character rate.
(Score: 0) by Anonymous Coward on Thursday May 02 2019, @08:51PM
OCR is step one, next step is manual hard labor, ala Project Gutenberg Distributed Proofreaders [pgdp.net] (also free software, like this OCR and many others)