Archivists Who Code

View Only

« Return to Resource Center

PDF/OCR script

Recommend

06-10-2022 03:21 PM

Brian Thomas

This is a test as much as anything else to see if uploading a python file will work.
I created this tool with a use case of receiving a bunch of PDFs from a state agency and not having 100% OCR. Timestamps weren't a concern since these were digitized from an original paper already. It scans a document for existing OCR. If the OCR isn't there or it isn't good, it will try to re-run the OCR. I never got this to work 100% right, so I added a part that it will take an existing PDF, break it into individual page images (JPGs) OCR-to-PDF those individually, and then aggregate the individual PDFs into a larger file.

I haven't cleaned it up for a generic use yet and there are parts specific to the original project so you will have to adapt for re-use.

Statistics

0 Favorited

Attachment(s)

pdfOCR.py 2 KB 1 version
Uploaded - 06-10-2022
to make sure everything in a folder is actually OCR'd instead of just claimed to be

Download

Archivists Who Code

« Return to Resource Center

PDF/OCR script

Related Entries and Links

Council of State Archivists

Contact Information

Quick Links

Privacy & Terms

Archivists Who Code

« Return to Resource Center

PDF/OCR script

Related Entries and Links

Council of State Archivists

Contact Information

Quick Links

Privacy & Terms

Social Media