python/python-pdfminer/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

PDFMiner is a tool for extracting information from PDF documents. Unlike
other PDF-related tools, it focuses entirely on getting and analyzing
text data. PDFMiner allows one to obtain the exact location of text in a
page, as well as other information such as fonts or lines. It includes a
PDF converter that can transform PDF files into other text formats (such
as HTML). It has an extensible PDF parser that can be used for other
purposes than text analysis.

PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.

pdf2txt.py

pdf2txt.py extracts text contents from a PDF file.  It cannot recognize 
text drawn as images.  It also extracts locations, font names/sizes, 
writing direction.  It requires a password for password protected PDF 
documents.  You cannot extract any text from a PDF document which does 
not have extraction permission.

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML
format. This program is primarily for debugging purposes, but it's also
possible to extract some meaningful contents (e.g. images).