Usage
Using the GUI
The first time you upload a PDF, pd3f
will download some large languages models.
After it’s finished access the Web-based GUI at
http://localhost:1616.
After uploading a PDF you will get redirected to a web page displaying progress / results of the job.
Using the API
import time
import requests
files = {'pdf': ('test.pdf', open('/dir/test.pdf', 'rb'))}
response = requests.post('http://localhost:1616', files=files, data={'lang': 'de'})
id = response.json()['id']
while True:
r = requests.get(f"http://localhost:1616/update/{id}")
j = r.json()
if 'text' in j:
break
print('waiting...')
time.sleep(1)
print(j['text'])
Post params:
lang
: set the language (options: ‘de’, ’en’, ’es’, ‘fr’)fast
: whether to check for tables (default: False)tables
: whether to check for tables (default: False)experimental
: whether to extract text in experimental mode (footnotes to endnotes, depuplicate page header / footer) (default: False)check_ocr
: whether to check first if all pages were OCRd (default: True, cannot be modified in GUI)
You have to poll for /update/<uuid>
to keep up with the progress. The responding JSON tells you about the status of the processing job.
Fields:
log
: always present, text output from the job.text
andtables
andfilename
: only present when the job finished successfullyposition
: present if on waiting list, returns position as integerrunning
: present if job is runningfailed
: present if job has failed
Scaling
You can also run more parsr workers with this:
docker-compose up --scale worker=3
To increase the frontend workers, adapt the env WEB_CONCURRENCY
in docker-compose.yml
:
WEB_CONCURRENCY=2
You may as well create a new docker-compose.yml
to override certain settings. Take a look at
docker-compose.prod.yml
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up --scale worker=2
House Keeping
Docker uses three volumes:
pd3f-data-uploads
: input & output files, mapped to./data/pd3f-data-uploads/
pd3f-data-cache
: internal volume, storing data so you don’t have to download model files over and over againpd3f-data-to-ocr
: internal volume, temporary location for PDFs to get OCRd. Files get deleted but logs get kept.
Results are kept for 24 hours per default. But no files get deleted automatically (only the results in the queue (e.g. the extracted text)).
Run this command from time to time to schedule jobs in order to delete files in pd3f-data-uploads
and pd3f-data-to-ocr
.
docker-compose run --rm worker rqscheduler --host redis --burst