Update 10 April: It pays to report problems like the one described below to Google's customer support. Seven weeks ago I discovered the problem. One week ago I reported it. Today the problem was suddenly gone, probably because Google updated the two ebooks involved and pushed new versions of the files to my phone.
I usually shop around for a good price when I buy e-books, and lately Google's bookstore has received my custom. It's not a very high-profile store – you see, this isn't the well-known Google Books, where they offer scanned paper books in your browser. This is something called, clunkily, Google Play Books or Books On Google Play, where you can get copy-protected e-books for off-line reading.
A funny thing about this service is that many or all of Google's e-book files contain original bitmaps scanned from paper books [or are they PDF images of the layout?]. You can toggle between the real e-book, which is the product of Optical Character Recognition probably followed by human proofreading, and the scanned pages. This won't do you much good on a little phone screen, but anyway.
Now, the two most recent books I bought from Google Play Books have a strange glitch. When I complained about it to customer service, I received prompt friendly help. When none of their suggested fixes worked, I was offered a refund. So this is not a disgruntled customer blog entry. Still the problem is so strange that I want to blog about it just as a technical conundrum.
On my Android smartphone, the OCRed texts in my e-book copies of Adam Roberts' Jack Glass (2012) and Neal Stephenson's REAMDE (2011) have lost all their apostrophes. All their quotation marks. All their long dashes. And all their diacritic characters. When Stephenson writes “naïveté”, my e-book says “navet”, which is French for turnip. When the problem first showed up, in Roberts' book, I actually thought he wrote non-standard English as a futuristic device.
When you run operating systems in non-English language modes, like Swedish or even Chinese, you get used to misidentified characters, with ÅÄÖÜ becoming all kinds of junk symbols. But this doesn't look like a case of that. Google's reader software is just quietly omitting some of the most common characters in English novels!
The problem isn't new. I've found references to it on-line starting December 2010. Strangely, most of the complaints are about science fiction novels. Dear Reader, what's your take on this?
I haven't specifically encountered this with Google Play Books (I have not yet moved into the e-book world), but OCR is a longstanding problem, especially with accented characters. Even then, the OCR deployments I have had to deal with at least try to turn characters into something, however nonsensical (such as &eactute; becoming 6). Apostrophes and quote marks are especially tricky, because they might be interpreted as blemishes on the page, rather than punctuation (although commas and periods, which should be just as vulnerable, seem not to get that treatment). Simply omitting a character is the worst possible response, so it sounds like Google need to check their OCR settings and try again with some of these books. Does anybody do quality control there?
If you use Tex or LaTeX, be careful when copying and pasting text from other sources. These programs were written at a time when keyboard input was limited to the 96 printable characters of 7-bit ASCII (characters 0-31 in ASCII are not printable), so there are ways of getting most of the characters that appear in languages based on the Latin alphabet (the thorn, used in Icelandic and Old English, is a curious exception). But it will ignore characters that aren't among those 96 printable characters, as I have had occasion to observe. So if, for example, you want an Å, you type \AA.
The two ebooks I mentioned look like they were proofread quite diligently after OCR. There is no way that the proofreaders would have accepted an entire lack of apostrophes in these novels. My guess is that there is something the matter with the Swedish version of the reader software. The reason that I've seen the problem only in the two latest books I've bought may be that a recent upgrade introduced a new bug.
While I don't doubt your report of a technical error, this has nothing to do with OCR.
The ebooks you mention were not converted from the paper copy. They are both new enough that they had to have been published digitally.
Just an FYI from a tech expert. ;)
I find it inexplicable that they would include the scanned bitmaps in the file if the e-text weren't an OCR product of the same.
So you saw the scanned images? Are you sure they weren't PDF page images?
I don't know why page images would have been included in a book published in 2012 or 2011. There is no reason for them to have been scanned.
Good point. I can't tell if it's a scan or a PDF page image.
Thanks for the mention of Google Play. Looks good.
(OT) Anybody with a background in biology that can make a credibility assessment of this article? If the claims are well-founded, it is a rather spectacular discovery.
"Breast cancer research uncovers the fountain of youth" http://medicalxpress.com/news/2015-04-breast-cancer-uncovers-fountain-y…
"-With the TIMP1 and TIMP3 “architects” missing, the pool of stem cells expanded and remained functional throughout the lifetime of these mice."
Birger@8: It's a very preliminary study, but it doesn't really seem to say what the (poorly written) press release wants us to think it says. It's an interesting finding in a mouse model, but given that these are heavily modified mice, I don't think it is going to have any immediate impact on human lifespans.
And haven't you heard "We've got plenty of Youth, how about a fountain of Smart?" :)
I have had some similar issues with non-ASCII characters being replaced with an empty box as I move between word processors, between editable and PDF format, or between operating systems. Just deleting the accented characters is a new one. I would have thought that with Unicode issues converting characters were over and done with.
There is some serious computer science research trying to figure out how to digitize accented Greek (especially since a page of Greek can be mixed with Latin letters and Arabic numerals in the margins so the OCR need to distinguish ί from i and β from 13).
"Earth ate a Mercury-like body early in its history, study finds" http://phys.org/news/2015-04-earth-ate-mercury-like-body-early.html
(voice of old geezer) "Play book ate the apostrophes? When I was a kid we had to worry about the Earth eating an entire planet! We would have been happy if we only had had to worry about the bloody apostrophes!"
@Birger: I hope that that phys.org link is not an accurate description of the article (which apparently happens with some regularity). The part about the Earth absorbing a Mercury-like body isn't new; that's how the Moon is thought to have originated. But the part about the collision being necessary for our magnetic field sounds highly implausible to me. Mercury, which is obviously a Mercury-like body, has an internal magnetic field. And the factors which allow an iron/nickel core to separate from mantle rock would tend to push non-oxidized uranium and thorium toward the center of the earth, as those metals are denser than iron.