


Far better to try to obtain that if you can. Having the output PDF is not the same as having the source document. In any case, you should never expect perfect results. Different software is going to do this better than others, and it's also going to depend on how the PDF was made.

Even if you did, your PDF viewer might not know about it.)Īnyway, it's up to your software to implement some kind of "artificial intelligence" to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. (A few recent PDFs do store some information about this stuff, but that's a new technology, and you'd be lucky to find PDFs like that. In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. Firstly, you have to understand what a PDF is.
