I'd like to start our tour of book and library information-management techniques with a glance at the humble back-of-book index. I started the USDA's excellent indexing course back in the day, and while it became clear fairly quickly that I do not have the chops to be a good indexer and so I never finished the course, I surely learned to respect those who do have indexing chops. It's not an easy job.
Go find a book with an index and flip through it. Seriously, go ahead. I'll wait. Just bask in the lovely indentedness and order of it all.
Now answer me a question: Should Google be calling that huge mass of crawled web data it computes upon an index?
Arguably, it shouldn't, though this is absolutely a lost battle; the word "index" is polysemous and always will be. What Google has is more along the lines of a concordance of the web. What's a concordance, you ask? A list of words in a given corpus of text, along with pointers to where those words are used in the corpus. Way back in the day, compiling concordances to important literature (e.g. religious texts) was considered worthy scholarly work. Today, of course, feeding a text into a computer can yield a concordance in seconds—I'm no great shakes as a programmer, but even I could hack up some concordance software if I had to.
Google's index is a bit more than a straight-up concordance: they do stemming and some n-gram analysis and other fancy-pants tricks. But it is still qualitatively different from a back-of-book index. How? I'll adduce three major differences: human intervention, terminological insularity, and intentional grouping.
There is a standard documenting what an index is for and how to create one. I'm not paying over $250 to own it, but I'll happily give you the gist.
An indexer presented with a book reads it at least twice, with concentrated attention. She is looking for concepts that the book treats usefully and/or in some depth, because an index containing every passing mention of everything is usually useless to someone asking "does this book have useful, original information on topic X?"
(I did say "usually." Sometimes a topic is so terribly remote or abstruse that even the slightest mention is useful. That's when a concordance can be superior to an index. Google Books is a godsend to lovers of minutiae.)
Please note that I said concepts, not "words" or even "phrases." A recurring problem in information management is that human language is truly annoying about using different words for the same thing, in various sorts of ways that this post is already too long to discuss in depth. Suffice to say that part of the indexer's job is to tease out concepts in the text that aren't necessarily labeled consistently or even labeled at all. A text on web design may never actually use the word "usability," for example, but that doesn't mean it has nothing to say about the subject! A good indexer will work that out.
So how does an indexer label the concepts she finds? Well, ideally, the text has done that for her; that's why an index is more insular than Google, which makes considerable use of other people's labels for web pages insofar as those are discoverable through links. (That's what Googlebombing is all about, of course.) The indexer is not slavishly bound to the text's language, however. She is allowed to take into account the text's readers, and (what she believes to be) their language use.
An indexer will not lightly discard the text's usage. What she will do is use "See" entries to connect likely reader usage to the text's usage. If the aforementioned web-design text casually throws in "HCI" without ever expanding it (shame on the editor! but it does happen), a smart indexer will throw in an entry "Human-computer interaction. See HCI." Remember this trick. We will see it in other forms later.
A See entry is not the same as a See also entry. See entries are intended for more-or-less synonymous terms. Rather than wastefully repeat the entire litany of page numbers for every synonym of a given term, pick the likeliest term (probably the text's most-often-used term, but again, the indexer has some discretion) and point the other synonyms to it. See also entries are for related terms, other information that in the indexer's judgment a reader might be interested in.
See also entries are another example of the grouping function of an index, alongside the entire idea of bringing mentions of the same concept that have been scattered throughout the text together in a single entry. Google does not do this save by haphazard accident. A few other search engines try (have a look at Clusty), but the results typically don't make entire sense—and why should they? they're using algorithms to do a human's job!
Purely mechanical questions such as page count enter into index compilation as well; publishers reserve a certain number of pages for the index (or in the hurly-burly of typesetting, a certain number of pages become available), and the index must be chopped to fit. You can imagine, I'm sure, that it's much harder to do a short index than a longer one!
Indexing electronic books introduces user-interface and workflow questions. The print book has the immensely convenient "page" construct to use as a pointer. The electronic book may have pages—or it may scroll, or the page boundaries may change according to font size, or… you see the UI problem, I trust. It's not insoluble, but it's annoying. The workflow problem is simple: how (and when in the production process) does the poor indexer mark the places a given entry should point to?
When I was doing ebooks back in the day, these problems hadn't been solved yet. I worry sometimes that if they remain unsolved, the noble art of book indexing will wither and die—and the search engine, as I hope you now understand, is not an entire replacement.
Go back and flip through that book index again. Appreciate it a little more? Excellent.
An index is a consciously designed method of finding mentions of subjects, persons, or ideas. A Google search return is merely a software-derived order based on rankings of key words, embedded words, and other data designed to return the 'mostest.'
Google is an incredible tool but it is comparing apples to a bushel of acorns to think a Google search is anything but an easily manipulated return of key phrases.
Indexers, like bibliographers, are unsung heroes of enlightenment and progress. Overworked and poorly paid they delight in obsessively creating something that even a child can use and profit by.
This was quite illuminating. We occasionally think about Semantics and the differences between data, metadata, and meaning.
Thank you - a find reading for students in organization and indexing classes.
Sigh: I was hoping to point you to the NISO standard for indexing, since NISO standards are (ahem) free (fairly rare for an accredited standards organization)--but Z39.4 was withdrawn, possibly because of the ISO equivalent you link to (which isn't free).
"The electronic book may have pagesâor it may scroll, or the page boundaries may change according to font size, orâ¦ you see the UI problem, I trust. It's not insoluble, but it's annoying. The workflow problem is simple: how (and when in the production process) does the poor indexer mark the places a given entry should point to?"
That problem has been solved with embedded indexing (which actually goes back to the 1980s). The index headings are hidden in the text by means of special codes so that they are not displayed to the reader, and the index is then generated by displaying the headings and some form of link to where it refers to in the text. That could be a page number or even a hyperlink.
Embedded indexes are more work for the indexer to create but means that the index will stay accurate even when the text is repaginated for display, for example, in an e-book reader.
Thank you for this wonderful analysis. I'll present it the next time someone tells me "Google can do it," where "it" can be nearly any library function except storytime.
Glad you liked it! Stay tuned for more...
Walt mentioned NISO and ISO standards for indexing. The reason there is no NISO standard is that agreement on a final version couldn't be reached, I believe because of disagreements on the inclusion of automated indexing as a valid alternative to human (manual) indexing. The NISO technical report (NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devices by James D. Anderson) is freely available for download from www.niso.org/standards/resources/tr02.pdf.