OCR, Our Friend…Most of the Time
Big news today in the world of historical and genealogical research: The National Archives and Records Administration (NARA, archives.gov) has added Optical Character Recognition (OCR) to its search engine. According to NARA OCR will affect the NARA Catalog’s JPG or PDF format records added since June 2019. NARA is still determining how to retroactively process records digitized before that date.
NARA is using one of the best available open-source options called Tesseract. Tesseract was created by Hewlett-Packard in 1985 and continually updated by HP until the company chose to get out of the OCR market. In 2006 Google decided to sponsor (continue the development) Tesseract.
OCR is exciting technology that is extremely helpful for searching records. In short OCR is computer reading of a document or photo that will allow indexing of the contents which enables us to access them on sites such as Ancestry.com and FamilySearch.org. There are however, limitations to the technology. Understanding these limitations can help you refine your search and (hopefully) not miss your genealogical gems!
Even a computer brain has to be able to READ the document. Just as it is challenging for the human eye to read an out of focus or dark image, OCR is no different. These readability issues may not be visible to the naked eye, but may exist, thus a document that seems perfectly clear to you may still lead to OCR mistakes. Here is an example of an article on Newspapers.com where a search missed the first usage of the name “Crebo.” Newspapers.com allows the reader to check OCR text at the bottom of the image. In checking for the first instance of Crebo, the OCR read “Creb0i” with a zero in place of the “o” and an “i” at the end. The text looks clear to the human eye, but for some reason OCR saw it differently.
Check for common mistakes. Replacements such as l for i, m for n or numbers for a letter B-8, S-5, O-0. The best method for combatting these errors is to know where they are in your surname research. View it like a puzzle. Write your name correctly and guess at the errors. I find it easiest to create a chart for each name I research. If you are not sure, ask Google to tell you the most common OCR errors, then assess where they are in your research.
Use spell check on OCR documents. If you find a document that SHOULD have your ancestor’s name but it does not appear, try spell-checking the document. Many OCR errors use characters or numbers rather than letters. Spell check will help you find those quickly.
READ the text. Given the nature of many OCR mistakes, scanning a document for weird words that seem out of place may help. Often in a single document OCR will make the same mistake twice (e.g. read Tom as Ton) so take a look. Are there words that seem incorrect but are not misspelled?
Even with mistakes OCR is a great tool for genealogists. I can’t wait to start using it on items in the NARA catalog!