Archivists Who Code

 View Only

PDF/OCR script 

06-10-2022 03:21 PM

This is a test as much as anything else to see if uploading a python file will work.
I created this tool with a use case of receiving a bunch of PDFs from a state agency and not having 100% OCR. Timestamps weren't a concern since these were digitized from an original paper already. It scans a document for existing OCR. If the OCR isn't there or it isn't good, it will try to re-run the OCR. I never got this to work 100% right, so I added a part that it will take an existing PDF, break it into individual page images (JPGs) OCR-to-PDF those individually, and then aggregate the individual PDFs into a larger file.

I haven't cleaned it up for a generic use yet and there are parts specific to the original project so you will have to adapt for re-use.

Statistics
0 Favorited
Attachment(s)
txt file
pdfOCR.py   2 KB   1 version
Uploaded - 06-10-2022
to make sure everything in a folder is actually OCR'd instead of just claimed to be

Related Entries and Links

No Related Resource entered.

The inclusion of any resource in the CoSA Resource Center does not imply a recommendation or endorsement of the resource by CoSA. Any views or opinions presented in the resource comments are solely those of the author and do not represent the views and opinions of CoSA