Searching Pdf Files For Text

Adobe Editor Free Download
Search Pdf Files For Text Programmatically Python
Searching Pdf Files For Text Files

Active9 months ago

Problem
I'm trying to determine what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text, preferably using python. All PDFs are searchable, but I haven't found a solution to parsing it with python and applying a script to search it (short of converting it to a text file first, but that could be resource-intensive for n documents).

What I've done so far
I've looked into pypdf, pdfminer, adobe pdf documentation, and any questions here I could find (though none seemed to directly solve this issue). PDFminer seems to have the most potential, but after reading through the documentation I'm not even sure where to begin.

In order to search in PDF files, you will need to set the In files/types filter to.pdf. This is a very important step; without setting your filter to.pdf UltraFinder will search for text in all files within your find locations! You can configure your PDF search further with more filters and further find options if you'd like. By default, if you open Adobe Reader and press CTRL + F, you’ll get the normal search box. It is located at the top right. To use the advanced PDF search option, you can choose Advanced Search from the Edit drop down menu or press SHIFT + CTRL + F. Go ahead and enter the phrase you are searching for in the search box. There are two options that can be used to search the text in PDF documents: they are the Find and Search functions. Click Find on the far rght of the ribbon (or press Crtl+F) to enable the Find function. The Find function box will open. Enter search terms in the text box and press enter to search the active document.

Is there a simple, effective method for reading PDF text, either by page, line, or the entire document? Or any other workarounds?

InsarovInsarov

3211 gold badge5 silver badges14 bronze badges

6 Answers

This is called PDF mining, and is very hard because:

PDF is a document format designed to be printed, not to be parsed. Inside a PDF document,text is in no particular order (unless order is important for printing), most of the time the original text structure is lost (letters may not be grouped as words and words may not be grouped in sentences, and the order they are placed inthe paper is often random).
There are tons of software generating PDFs, many are defective.

Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you knowwhat problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).

An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.

So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.

I would really like to be proven wrong.

[update] Mercedes benz font download.

The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:

Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.
If the PDF you are analyzing is 'searchable', you can get very far extracting all the text using a software like pdftotext and a Bayesian filter (same kind of algorithm used to classify SPAM).

So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).

Paulo ScardinePaulo Scardine

44k8 gold badges99 silver badges122 bronze badges

I am totally a green hand, but somehow this script works for me:

Emma YuEmma Yu

I've written extensive systems for the company I work for to convert PDF's into data for processing (invoices, settlements, scanned tickets, etc.), and @Paulo Scardine is correct--there is no completely reliable and easy way to do this. That said, the fastest, most reliable, and least-intensive way is to use pdftotext, part of the xpdf set of tools. This tool will quickly convert searchable PDF's to a text file, which you can read and parse with Python. Hint: Use the -layout argument. And by the way, not all PDF's are searchable, only those that contain text. Some PDF's contain only images with no text at all.

MikeHunterMikeHunter

3,0331 gold badge12 silver badges12 bronze badges

I recently started using ScraperWiki to do what you described.

Here's an example of using ScraperWiki to extract PDF data.

The scraperwiki.pdftoxml() function returns an XML structure.

You can then use BeautifulSoup to parse that into a navigatable tree.

Here's my code for -

This code is going to print a whole, big ugly pile of <text> tags.Each page is separated with a </page>, if that's any consolation.

If you want the content inside the <text> tags, which might include headings wrapped in <b> for example, use line.contents

If you only want each line of text, not including tags, use line.getText()

It's messy, and painful, but this will work for searchable PDF docs. So far I've found this to be accurate, but painful.

JasTonAChairJasTonAChair

Adobe Editor Free Download

1,1131 gold badge13 silver badges26 bronze badges

I agree with @Paulo PDF> Not the answer you're looking for? Browse other questions tagged pythonparsingpdftext or ask your own question.

Active2 years ago

I have a need to search a pdf file to see if a certain string is present. The string in question is definitely encoded as text (ie. it is not an image or anything). I have tried just searching the file as though it was plain text, but this does not work.

Is it possible to do this? Are there any librarys out there for .net2.0 that will extract/decode all the text out of pdf file for me?

Cœur

22.8k10 gold badges130 silver badges188 bronze badges

NathanNathan

7,25410 gold badges45 silver badges58 bronze badges

Search pdf files for text programmatically java

closed as off-topic by Bhargav Rao♦May 1 '17 at 19:48

This question appears to be off-topic. The users who voted to close gave this specific reason:

'Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.' – Bhargav Rao

If this question can be reworded to fit the rules in the help center, please edit the question.

3 Answers

There are a few libraries available out there.Check out http://www.codeproject.com/KB/cs/PDFToText.aspxand http://itextsharp.sourceforge.net/

It takes a little bit of effort but it's possible.

volatilsisvolatilsis

You can use Docotic.Pdf library to search for text in PDF files.

Here is a sample code:

The library can also extract formatted and plain text from the whole document or any document page.

Disclaimer: I work for Bit Miracle, vendor of the library.

Search Pdf Files For Text Programmatically Python

BobrovskyBobrovsky

9,02218 gold badges63 silver badges113 bronze badges

In the vast majority of cases, it's not possible to search the contents of a PDF directly by opening it up in notepad -- and even in the minority of cases (depending on how the PDF was constructed), you'll only ever be able search for individual words due to the way that PDF handles text internally.

My company has a commercial solution that will let you extract text from a PDF file. I've included some sample code for you below, as shown on this page, that demonstrates how to search through the text from a PDF file for a particular string.

RowanRowan

1,6792 gold badges18 silver badges21 bronze badges

6 Answers

Adobe Editor Free Download

closed as off-topic by Bhargav Rao♦May 1 '17 at 19:48

3 Answers

Search Pdf Files For Text Programmatically Python

Searching Pdf Files For Text Files

Not the answer you're looking for? Browse other questions tagged c#.netsearchpdf or ask your own question.