[TriLUG] To The Oracle:

Wed Apr 30 11:41:34 EDT 2014

On 04/29/2014 09:42 PM, Brian McCullough wrote:
> Greetings, all.
>
> Once again, I have what I hope is an interesting question that some or
> many of you, can help with.
>
>
> Last fall, I learned about creating PDFs from PHP code, now I need to go
> the other way, and extract data from PDFs.
>
> I have found more than one method in PHP for reading PDFs, but,
> unfortunately, even the newest methods don't seem to be able to deal
> with "modern" PDFs, version 1.4.
>
> Here, instead of text with other markup, as we see in older PDFs, there
> seem to be blocks of binary code intermixed with markup.
>
>
> Does anybody have any suggestions for dealing with this new version of
> PDF?

Hello Brian,

Just a couple of observations.  If you are working with a limited range
of PDF documents, or ones that have a common structure this may not apply.

My development team examined how to extract data from PDF files a few
years ago.  We were using Java at that time.  What we found was that PDF
often mangles the text data that it contains in ways that make it very
painful to extract.  For example, multi-column data may extract as
sentence fragments in an unreadable sequence.  If you are trying to
extract tables of words and numbers like we were, we found that the
information that we needed was arbitrarily organized and intermingled
with binary control sequences.  In some cases, such as proportionally
spaced text, words are broken up into individual letters with binary
sequences in between.

Our takeaway was that PDF is a print-ready document format, that makes
no attempt to preserve the human-readable information that it contains
in a consistent, extractable way.  We gave up and found other ways to
get what we were after.

You may have read that some PDF extraction libraries use the annotations
that occur within some versions of PDF to help guide the process.  The
Java library that we were using depended on this.  We were not able to
find out how these annotations get into the PDF documents.  If you have
input into the PDF creation process, you may be able to use this to your
advantage.

Good luck,

Scott C.

-- 
Scott Chilcote
scottchilcote at ncrrbiz.com
Cary, NC USA