Now on ScienceBlogs: HeartlandGate: Anti-Science Institute's Insider Reveals Secrets

ScienceBlogs Book Club: Inside the Outbreaks

The Book of Trogool

E-research, cyberinfrastructure, data curation, open access... an academic librarian examines how computers change research and libraries.

Profile

Book of Trogool bloggers are Elizabeth Brown, Dorothea Salo, and Sarah Shreeves.

Wondering what the blog's name means? Allusion explained here.

Want to contact me out-of-band? Please email dorothea.salo at gmail.

Commenters: please read and abide by this blog's comment policy. Thanks!

Upcoming talks and travel

Archives

Recent Comments

Blogroll: Library Folk

Blogroll: Research and Researchers

« Tidbits, 7 September 2009 | Main | Classification »

When is text in a PDF not text?

Category: Miscellanea
Posted September 9, 2009 by Dorothea Salo.

I see this confusion so often it seems worth addressing.

If you scan a page of text, what you have is a picture. A computer sees it not as letters, numbers, and punctuation—but as pixels, bits of light and shade and color, just like the pixels in your favorite family photo on Flickr.

You can't search for, extract, highlight, or cut-and-paste such "text." It doesn't matter whether you embed the picture in a PDF; you still can't search it. Ceci n'est pas une texte!

Compare this to creating a PDF from a word-processing or page-layout document. The computer already thinks of the text in these documents as text, so it can embed the text in the PDF as text. The text is thus searchable, extractable, and all that good stuff. (Within limits. PDF is horrible for text-mining, for reasons I may decide to discuss sometime.)

To make the text in a scanned picture searchable, you must use Optical Character Recognition (OCR) technology on the picture. OCR tools look at the picture and try to figure out what letters, numbers, and punctuation it contains. Once you've OCRed the picture, you may embed the text in the PDF along with the picture, whereupon you may be able to search and extract it.

But no OCR, no text, as far as computers are concerned.

Was that clear?

Share on Facebook
Share on StumbleUpon
Share on Facebook

TrackBacks

TrackBack URL for this entry: http://scienceblogs.com/mt/pings/119581

Comments

1

The only thing I think needs to be added to this explanation is the fact that OCR is astonishingly unreliable. God help you if your document contains the word "bum"; most OCR software will render that as "burn". Flecks of dirt will be interpreted as punctuation or as stray bits of letters. Poor-quality or old-fashioned type (which is more crowded than modern type, usually) will also contribute scads of "scannoes". Nobody has ever managed to write a program that can reliably recover the clean, abstract letter-sequences that underlie the blurry blobs of ink that make up real printing.

Project Gutenberg has been crowd-sourcing the proofreading of scanned texts for years (here), and it's an eye-opening exercise. (Virtuous, too, if you'll forgive the plug.) Every page has OCR errors.

Posted by: ACW | September 10, 2009

ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.