pd3f – Beyond PDF

pd3f is an Open-source PDF text extraction pipeline that is self-hosted, local-first and Docker-based.

pd3f reconstructs the original continuous text with the help of machine learning.

pd3f is still in an experimental stage, so please use it with caution.

Try out the demo View Examples

Features

Some reasons why pd3f is for you

Complete pipeline

Scans are automatically OCRd

Local or remote

Run it on your computer or on a server

Easy to setup

Docker makes installation easy

Overview

A longer introduction to pd3f in German on the site of the Prototype Fund.

pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It’s built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.

Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and/or spaces. It uses language models to guess how the original text looked like.

pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German pd3f supports English, Spanish, French and Italian. More languages will be added later.

pd3f includes a Web-based GUI and a Flask-based microservice (API). You can find a demo at demo.pd3f.com.

pd3f is licensed under the GNU General Public License v3.0 and is looking for contributors. Write an email to hi@jfilter.de if want to cooperate.

Recorded Online Talk about pd3f

About

pd3f was created by Johannes Filter and financed by the German Federal Ministry of Education and Research as part of their funding for Open-source software with the Prototype Fund.

Johannes Filter is an independent researcher, software developer and data analyst. His work is focused on human-computer interaction, machine learning and natural-language processing. Website, Twitter

Legal Contact / Impressum, Privacy Policy

The project was funded by the German Ministry of Education and Research under the funding reference number 01IS19S18. The author is responsible for the content of this publication.