Actual comps response: Information Retrieval

By cpikas on August 9, 2009.

Now that I'm not scared to look at my responses... This one doesn't look so bad, so I'm sharing. Please do keep in mind that this was written in 2 hours, by a tired person, with tired fingers!

---

Christina K. Pikas

Comps Information Retrieval (Minor)

July 20, 2009

Question F2: Design an information retrieval system for scientists that covers full-text peer-reviewed articles as well as blogs and wikis

0. Introduction

Today, scientists use more than just the peer-reviewed journal literature in their work, but our information retrieval systems such as our library research databases and online catalogs only cover the journal literature, conference papers, books by title, and to a limited extent, protocols and technical reports. We must support a more comprehensive view of information to stay relevant and to support scientific work. Accordingly, this essay describes the design of an information retrieval system to provide access to this broader variety of scientific information a user might need. The essay starts with an outline of the design elements of the system and continues by identifying the major issues in designing such a system. It ends with a discussion of the best ways to evaluate the system.

1. The design elements of the system

An information retrieval system must take an input that is a representation of a users' information need, match that information to representations of information in the system, present search results, and take feedback. A modern system must go further. It should support learning and exploration, provide the users with a workspace, and be standards-compliant and interoperable with the ecology of other information systems used by the expected users. This section describes the following design elements:

â¢ Coverage
â¢ Query formulation for expected search types
â¢ Representation of information/internal information organization
â¢ Matching
â¢ Presentation of search results
â¢ Working with search results
â¢ Interoperability

1.1 Coverage This section describes the types of information that should be included in the system and the sources for that information.

1.1.1. Types of information

There are many different types of information a scientist needs to do her work. Some types of information might include:
â¢ research articles written on the subject to refine her research question
â¢ protocols and textbooks to learn about research methods
â¢ catalog information for suppliers for lab consumables
â¢ grant submission information
â¢ policy information for her research subjects or for conducting certain types of research
â¢ conference and event information
â¢ contact information for potential collaborators, mentors, or assistants/technicians
â¢ information to keep up with colleagues' work or new advances in the field

With such diverse and complicated information needs, of which those listed above are a small sample, this system should probably start by covering external information such as the research literature, publicly available blogs and wikis, protocols, and grant resources. Organization specific information such as animal research boards would add value, but would require customization for each location. Likewise, full integration of personal information management tools would be ideal, but might be too difficult for the initial system.

1.1.2 Sources of information

1.1.2.1 Research information

The vast majority of the research literature is available online, for a fee, with a license. No institution, no matter how wealthy, has access to all of the information they need. Frequently, abstracts and tables of contents are available for free even to non-subscribed journals. Research articles are well covered by research databases such as Medline, Inspec, and Compendex. Most research databases and digital libraries can be added to a federated search using a web service or a Z39.50 connection. The major exception to this rule is in chemistry: Scifinder (Chemical Abstracts) cannot be federated. The source for the research information should be all of the research databases in science and the digital libraries as well as general searches.

1.1.2.2 Blogs and wikis

A general web search of all blogs and wikis will not be useful as it will be very difficult to avoid introducing lots of noise. Nature Publishing Group has created a directory of science blogs. Each blog must be nominated by another blogger who is already in the system and then reviewed by the system managers. Likewise, the ResearchBlogging site run by Dave Munger and sponsored by Seed requires a review of the blog before it can use the logo and have posts included in the listing. Some listing like this should be required for inclusion in the system. Once a source is included, there could also be a way to report abuse or if the blog is inappropriate.

1.1.2.3 People information

A useful system should help link scientists to other people. There are directories from the professional societies, but these are probably not open for use. Likewise, the COS product has an opt-in product, but it might not be available for federation. Other sources are lists of grant recipients, university or research institution directories, authors of published articles, and site users/profile creators.

1.2 Query formulation

Query formulation can be quite difficult, particularly in a system that covers such a diversity of information. For example, in blog searching many of the searches are filtering searches to set up an alert or to find a new blog to follow instead of ad hoc searches or navigational searches. This system should support these filtering searches, to enable the user to set up alerts on new topics or find continuing resources (like journals and blogs) to follow. Most searchers still use 2-3 keywords to "teleport" and then refine and iterate their search. Ideally, the system could know what information you already have in your personal stores and use that to show only novel information and to get a better representation of your information need. Also, search could be initiated from within Microsoft Office documents and web browsers by highlighting text or a drawing of a chemical, and then right clicking or otherwise invoking the retrieval tool.

1.2.1 Keyword search

A simple keyword box and a guided keyword search are expected in most interfaces. The system can support query formulation by spell-checking and auto-completing the search. It should also be able to recognize if the term input is a chemical, organism, person, or other category of controlled information and ask the user if she wants to search using that index.

1.2.2 Query by example

The user could input a document or other information and get more like that or get citing or cited documents.

1.2.3 Known item search

The user might want to retrieve a known document to get access to the full text or to see how it is represented in the system. The system should allow the user to enter a citation or even just a PMID or DOI to retrieve an item.

1.2.4 Browse

The system should allow the user to browse by any controlled vocabulary used including those representing people, chemicals, organisms, institutions, journals, and so forth.

1.3 Representation of information

Some of the information representations would just be the full text of the document, but machine aided indexing or entity extraction using some of the many controlled vocabularies and information organization systems used within science would be helpful. An early decision would be if the system is primarily a federated search or if it spiders and caches. A federated search will have the freshest results, but will be slow to display the results to the user. It also relies on the native search of the targeted source. A spider and cache model would be preferable for speed and the amount of pre-processing required to support entity extraction and other requirements listed in this essay, but one should not underestimate the policy negotiations required to make that possible (Summon seems to be doing this, so that might break some barriers). A federated search would not require the system to store representations of the information objects whereas a spider and cache model would.

1.4 Matching

There are several well-accepted information retrieval models that would work well here including Boolean, vector space, probabilistic, and language models. Their exact details are beyond the scope of this essay.

1.5 Presentation of search results

The presentation of the search results must have enough information to enable the user to judge the relevance of the results, understand how the system interpreted the query in order to provide feedback, and should further support the exploration of the information space.

1.5.1 How scientists judge relevance

Scientists judge relevance by topical measures, both direct and indirect, as well as by other measures such as novelty, timeliness or recency, authority, and subject discipline. Authority is judged using the author's name and affiliation and the publication venue. The number of citations an article has received is also a proxy for utility and authority.

Full text availability is important. In a large JHU libraries study, we found that users want to have a representation of how long it will take to get the information in the results (i.e., available immediately online, walk over to the library, 2-3 days from elsewhere in the institution, 2-3 weeks for ILL).

For web pages, the url is often used to make a first guess about the relevance. Key-word-in-context snippets are also helpful.

For items that are not "citable" some representation of usage or in-links can be used to indicate potential utility of the article. Likewise, for wikis the number and frequency of edits can show if the article is controversial or active, and the number of comments received for blog posts sometimes is informative.

Enabling the users to rate items, and then showing a user rating in stars next to the item can be useful.

1.5.2 Faceted presentation of search results

Faceted presentation of search results enables the user to explore the information space by showing the categories occurring in the retrieved set and how many times they occur. The user can then use these to further narrow the set.

1.5.3 Pivot browsing

Each person's name and controlled vocabulary term should link to allow the user to create a new search using that term.

1.6 Working with search results

The ideal system should allow the user to annotate search results, save some or all of them within the system or export them to another system. The search results should be sortable by any field and the user should be able to search within the results. The system should also allow users to share their work with others in the system in small group areas.

1.7 Interoperability

This system cannot be designed as an isolated system; it must take information from other information systems and be able to export information out. At a minimum, it should be interoperable with the following systems:
â¢ the library catalog and worldcat - for availability information as well as intellectual access information
â¢ open URL resolvers and electronic entitlement systems
â¢ usage statistic compilation software (COUNTER and SUSHI compliant)
â¢ interlibrary loan software
â¢ bibliographic management or pdf management software

2. Design issues

The primary design issues unique to this system are traceable to the diversity of information covered. There are known issues associated with searching research databases that cover journal articles and conference papers. There are known issues with searching the web, and some unresolved issues with searching social computing technologies. By making these sources all available in the same tool, we compound these issues and add issues related to conveying the authority (and how assessment of authority differs in the various sources), freshness/recency/timeliness, and mixing structured and unstructured data. It is clear that this is an open issue, because the current federated searches and overlay discovery tools do not do this well and they typically only search library catalogs, institutional repositories, research databases, and digital libraries. We should also address if or whether scientists want blog and wiki content to be surfaced in the same tool as the research literature. Some may prefer to keep this content separate.

A secondary design issue is getting access to the content and most likely mixing federated content with spidered content.

A third issue is getting enough user feedback from practicing scientists who are quite busy. It is more straightforward to get LIS or CS graduate students to test a system, but this system should be co-developed with and tested by the scientists who form the end user group.

3. Evaluation

A user orientation is critical to the success of this system. There should be formative evaluations while the system is being developed and summative at key points in the process. The first evaluations can be with paper prototypes, asking potential users to give subjective feedback on the design of the system. Other evaluations can be done with limited functionality prototypes and then beta and production systems.

3.1 Experimental or lab evaluations

During the development process early evaluations might require the use of assigned tasks so that the designers can be certain that the users test the desired features and functionality. Participants should be scientists and should be working on a topic of interest, if the topic is limited to those the system can support. Once a training period is complete, the interaction with the system can be monitored to see what paths or how people use it. A survey can be given at the end to surface any participant complaints.

3.2 Naturalistic or field evaluations

The final evaluation of the system should be in actual use by the expected users for an extended period of time. While they are using it, the users can provide feedback in the form of e-mails and comment forms, the developers can unobtrusively capture system logs, and there can be a survey near the end of the trial period. From the system logs, the developers can see what types of queries are being used and the path that users follow. The navigation paths and time for each stage along with the documents the users save can show the developers how various features are working.

4. Conclusion

An information retrieval system that covers social computing technology content with peer-reviewed content is quite complex. This essay has provided a high-level overview of the requirements for such a system given what we know about scientists information needs and information retrieval system design and use.

More like this

Q&A: What is information?

I received an email from someone with some questions about information theory; they relate to some sufficiently common questions/misunderstandings of information theory that I thought it was worth turning the answer into a post.

A Math Geek on Dr. Egnor's Evasions of Evolutionary Information

PZ has already commented on this, but I thought that I'd throw in my two cents. A surgeon, Dr. Michael Egnor, posted a bunch of comments on a Time magazine blog that was criticizing ID. Dr.

Google's New Privacy Policy Unveiled

From Google:

Quantum Information Graduate Program at Waterloo

The University of Waterloo is adding a quantum information graduate program, one step closer to being able to get a Ph.D. purely in quantum information. Application details here. Description of the program below the fold.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Yeah, me too.

August 2, 2010

I'm also leaving ScienceBlogs, but it's not for the reasons some others have given. I don't think Pepsi's blog will hurt my real life reputation and besides, it's been pulled, there have been apologies - it's time to forgive. July was the first month I've gotten enough hits to get a paycheck - and…

Very cool - American Physical Society offers free access to public libraries

July 29, 2010

This APS rocks! Here's the press release from PAMnet: FOR IMMEDIATE RELEASE APS ONLINE JOURNALS AVAILABLE FREE IN U.S. PUBLIC LIBRARIES Ridge, NY, 28 July 2010: The American Physical Society (APS) announces a new public access initiative that will give readers and researchers in public libraries…

Michael Pater, Connecticut artist, died today

July 25, 2010

He was also my husband's uncle. I only found two of his images online, the remainder are photographs of prints we have on our walls - intentionally poor quality for those. He was a member of the Lyme Art Association, so there may be more information on their site. The Courant (Hartford, CT)…

Hey maybe scientists should do more than just wait for their journal to issue a press release on their new fabu article

July 25, 2010

The authors thesis is that the only mandatory communication of results is in peer reviewed journal articles. Scientists aren't required to do other communicating and often leave communication to the public to the media. They ask if is this is adequate given the very low percentage of scientific…

Well, sometimes you just have to Google it

July 21, 2010

So there I was, try all kinds of librarian ninja tricks on the fanciest, most expensive research databases money can buy (SciFinder, Reaxys, Inspec...) and no joy. Couldn't find what I needed. I'm perfectly willing to admit that I don't know all that much chemistry, but usually I do ok since I work…