When is text in a PDF not text?

I see this confusion so often it seems worth addressing.

If you scan a page of text, what you have is a picture. A computer sees it not as letters, numbers, and punctuation—but as pixels, bits of light and shade and color, just like the pixels in your favorite family photo on Flickr.

You can't search for, extract, highlight, or cut-and-paste such "text." It doesn't matter whether you embed the picture in a PDF; you still can't search it. Ceci n'est pas une texte!

Compare this to creating a PDF from a word-processing or page-layout document. The computer already thinks of the text in these documents as text, so it can embed the text in the PDF as text. The text is thus searchable, extractable, and all that good stuff. (Within limits. PDF is horrible for text-mining, for reasons I may decide to discuss sometime.)

To make the text in a scanned picture searchable, you must use Optical Character Recognition (OCR) technology on the picture. OCR tools look at the picture and try to figure out what letters, numbers, and punctuation it contains. Once you've OCRed the picture, you may embed the text in the PDF along with the picture, whereupon you may be able to search and extract it.

But no OCR, no text, as far as computers are concerned.

Was that clear?

More like this

The only thing I think needs to be added to this explanation is the fact that OCR is astonishingly unreliable. God help you if your document contains the word "bum"; most OCR software will render that as "burn". Flecks of dirt will be interpreted as punctuation or as stray bits of letters. Poor-quality or old-fashioned type (which is more crowded than modern type, usually) will also contribute scads of "scannoes". Nobody has ever managed to write a program that can reliably recover the clean, abstract letter-sequences that underlie the blurry blobs of ink that make up real printing.

Project Gutenberg has been crowd-sourcing the proofreading of scanned texts for years (here), and it's an eye-opening exercise. (Virtuous, too, if you'll forgive the plug.) Every page has OCR errors.