pd3f-core

pd3f-core is Python package to reconstruct the original continuous text from PDFs with language models. pd3f-core assumes your PDF is either text-based or already OCRd. pd3f-core is at the heart of pd3f: a full Docker-based text extraction pipeline (including OCR).

pd3f-core first uses Parsr to chunk PDFs into lines and paragraphs. Then, it uses the Python package dehyphen to reconstruct the paragraphs in the most probable way. The probability is derived by calculating the perplexity with Flair’s character-based language models. Unnecessary hyphens are removed, space or new lines are kept or dropt depending on the surround words.

It’s mainly developed for German but should work with other languages as well. The project is still in an early stage. Expect rough edges and rapid changes.

API Documentation

API Documentation of pd3f-core: https://pd3f.github.io/pd3f-core/index.html

And check out the repository on GitHub for more information: https://github.com/pd3f/pd3f-core

Last updated on Apr 4, 2021