pdftotext
, part of the xpdf set of tools. This tool will quickly convert searchable PDF's to a text file, which you can read and parse with Python. Hint: Use the -layout
argument. And by the way, not all PDF's are searchable, only those that contain text. Some PDF's contain only images with no text at all. scraperwiki.pdftoxml()
function returns an XML structure.<text>
tags.Each page is separated with a </page>
, if that's any consolation.<text>
tags, which might include headings wrapped in <b>
for example, use line.contents
line.getText()