pd3f-core is Python package to reconstruct the original continuous text from PDFs with language models.
pd3f-core assumes your PDF is either text-based or already OCRd.
pd3f-core is at the heart of
pd3f: a full Docker-based text extraction pipeline (including OCR).
pd3f-core first uses
Parsr to chunk PDFs into lines and paragraphs.
Then, it uses the Python package
dehyphen to reconstruct the paragraphs in the most probable way.
The probability is derived by calculating the
Unnecessary hyphens are removed, space or new lines are kept or dropt depending on the surround words.
It’s mainly developed for German but should work with other languages as well. The project is still in an early stage. Expect rough edges and rapid changes.
API Documentation of pd3f-core: https://pd3f.github.io/pd3f-core/index.html
And check out the repository on GitHub for more information: https://github.com/pd3f/pd3f-core