Did you know that UC Libraries’ Digital Collections & Preservation Librarian, James Van Mil, and Digital Imaging Coordinator, Sidney Gao, have recently created a website and blog to share their digital collections documentation? No? Well, head on over and check it out: https://uclibs.github.io/digitization-workflow/ It covers all their hard work establishing UCL’s digital collection strategy, selection guidelines, accessibility standards, and so much more. As they continue to work to create a more robust and thoughtful digitization and digital preservation program for UCL this site will continue to evolve and grown and they will share their progress along the way via the blog.
In their very first blog post they tackled the important subject of OCR (Optical Character Recognition) and accessibility. Sidney shared their results from a recent experiment to see which OCR software performed the best under six document tests. They employed and tested six OCR softwares, some of which are proprietary and some that are open source: ABBYY Finereader for Mac, Google Cloud Vision, Tranksribus, Equidox, Adobe Acrobat Pro, and Tesseract.
Here is a preview of the six documents that Sidney and James tested:
To see how these six OCR softwares performed and how they stack up to their competitors, head on over to their blog and check out their results: https://uclibs.github.io/digitization-workflow/2020/08/07/ocr-comparison.html James and Sidney do plan to conduct further OCR tests in the future, so make sure to subscribe to their site in order to receive notification when they share the results from the next round.
If you have any direct questions for Sidney or James, you can find their contact information here.
Jessica Ebert [UCL] – Photographic Documentation Specialist (in working with Sidney Gao [UCL] – Digital Imaging Coordinator)