pd3f-core
pd3f-core
is Python package to reconstruct the original continuous text from PDFs with language models.
pd3f-core
assumes your PDF is either text-based or already OCRd.
pd3f-core
is at the heart of
pd3f: a full Docker-based text extraction pipeline (including OCR).
pd3f-core
first uses
Parsr to chunk PDFs into lines and paragraphs.
Then, it uses the Python package
dehyphen to reconstruct the paragraphs in the most probable way.
The probability is derived by calculating the
perplexity with
Flair’s character-based
language models.
Unnecessary hyphens are removed, space or new lines are kept or dropt depending on the surround words.
It’s mainly developed for German but should work with other languages as well. The project is still in an early stage. Expect rough edges and rapid changes.
API Documentation
API Documentation of pd3f-core: https://pd3f.github.io/pd3f-core/index.html
And check out the repository on GitHub for more information: https://github.com/pd3f/pd3f-core