pd3f

Installation

You need to setup Docker.

You need the docker-compose.yml file of this repository. You can download it separately or just fetch the whole repository:

git clone https://github.com/pd3f/pd3f

Then go to the folder of this repository and run:

docker-compose up

The first time the pd3f starts it will download the Docker images. You need to have ~8 GB of space to store all the software / data to run this.

Using the GUI

The first time you upload a PDF, pd3f will download some large languages models. After it’s finished access the Web-based GUI at http://localhost:1616.

After uploading a PDF you will get redirected to a web page displaying progress / results of the job.

Using the API

import time

import requests

files = {'pdf': ('test.pdf', open('/dir/test.pdf', 'rb'))}
response = requests.post('http://localhost:1616', files=files, data={'lang': 'de'})
id = response.json()['id']

while True:
    r = requests.get(f"http://localhost:1616/update/{id}")
    j = r.json()
    if 'text' in j:
        break
    print('waiting...')
    time.sleep(1)
print(j['text'])

Post params:

  • lang: set the language (options: ‘de’, ‘en’, ‘es’, ‘fr’)
  • fast: whether to check for tables (default: False)
  • tables: whether to check for tables (default: False)
  • experimental: whether to extract text in experimental mode (footnotes to endnotes, depuplicate page header / footer) (default: False)
  • check_ocr: whether to check first if all pages were OCRd (default: True, cannot be modified in GUI)

You have to poll for /update/<uuid> to keep up with the progress. The responding JSON tells you about the status of the processing job.

Fields:

  • log: always present, text output from the job.
  • text and tables and filename: only present when the job finished successfully
  • position: present if on waiting list, returns position as integer
  • running: present if job is running
  • failed: present if job has failed

Scaling

You can also run more worker with this:

docker-compose up --scale worker=3

To increase the frontend threads, change this line in docker-compose.yml:

command: gunicorn app:app --workers=5 --bind=0.0.0.0:5000

You may as well create a new docker-compose.yml to override certain settings. Take a look at docker-compose.prod.yml

docker-compose -f docker-compose.yml -f docker-compose.prod.yml up --scale worker=2

House Keeping

You will see three folders:

  • pd3f-data-uploads: input & output files
  • pd3f-data-cache: storing data so you don’t have to download model files over and over again
  • pd3f-data-to-ocr: temporary location for PDFs to get OCRd. Files get deleted but logs get kept.

Results are kept for 24 hours per default. But no files get deleted automatically (only the results in the queue (e.g. the extracted text)).

Run this command from time to time to schedule jobs in order to delete files in pd3f-data-uploads and pd3f-data-to-ocr.

docker-compose run --rm worker rqscheduler --host redis --burst