I wrote this up on the request of a colleague who heard my talk recently on open data. I'm posting it here for comment and adding some hyperlinks...
Moving from a Web of documents to a Web of data (or of Linked Open Data) is an oft-cited goal in the sciences. The Web of data would allow us to link together disparate information from unrelated disciplines, run powerful queries, and get precise answers to complex, data-driven questions. It's an undoubtedly desirable extension of the way that the existing networks increase the value of documents and computers through connectivity - Metcalfe's Law applied to more complex information and systems.
However, making the Web of data turns out to be a deeply complex endeavor. Data - here, a catchall word covering databases and datasets and generally meaning here information that is gathered in the sciences as a result of either experimental work or environmental observation - require a much more robust and complete set of standards to achieve the same "web" capabilities we take for granted in commerce and culture.
Unlike documents, the ultimate intended reader of most data is a machine. Some classic examples include search engines, analytic software, database back ends, and more. There is simply too much data in production to place people on the front lines of analysis. When data scales easily into the petabytes, we just can't keep up using the existing systems.
This machine-readability requirement is very different from the Web of documents, which was designed to standardize the way information is shown to people. Machine readability means we have to think, early and often, about the level of interoperability in any given chunk of data. "How "connectable" is it to other data?" should be the first question we ask of new data, because the level of effort required to make data connectable post-hoc is significant - frequently unbearable.
The connectability quotient creates significant pressures to build interoperability deep into the Web of data. It implies a level of rigor in the design of data that understands the intended use of that data is in a network context. Thus, we need to turn ourselves to the concept of interoperability and examine what it means in a data context.
There are three interlocking dimensions to interoperability in data: legal, technical, and semantic. By legal, we mean the contractual and intellectual property rights associated with the data; by technical, the standard systems (especially the computer languages) in which the data is published; and by semantic, the actual meaning of the data itself - what it describes, and how it relates to the broader world.
Each of these dimensions is complex on its own. Taken together, the three represent unsolvable complexity. The semantic layer alone requires an almost miraculous level of agreement on "what things mean," and anyone who has witnessed argument among scientists, be they economists of physicists, knows that even apparently simple topics turn contentious over matters as basic as definitions. Consensus on the technical layer is somewhat easier - the existence of the Web and the Semantic Web "stack" of standard technologies has begun to take a leadership position in data networking - but still difficult, long, and open to argument. One of the only opportunities we have is in the legal layer, where we can look to a broad set of successes in legal interoperability through the use of a simple, flat standard: the public domain.
The public domain is a very simple concept - no rights are reserved to owners, and all rights are granted to users. The public domain exists as a counterweight to copyright in the creative space, but in some countries - especially the United States - as a first option for data that is not considered "creative."
The public domain option currently underpins a wide variety of linked data that is already well on its way to achieving Web scale. From the International Virtual Observatory, whose members build an international data net on norms of "acknowledgment" rather than contracts of "attribution", to the world of genomics, where entire genomes and related data are harmonized nightly across multiple countries, the public domain creates complete interoperability at the legal layer of the data network, and serves as a foundation for the next layer of technical interoperability.
Interestingly we have yet to observe similar network effects emerging in cases where the underlying data is treated in a more conservative "intellectual property" context by using copyright licenses or database licenses inspired by copyright. Indeed, in the case of the international consortium mapping human genomic variation, the implementation of a "click through" license was found in practice to impede integration of that mapped variation with other public domain data, limiting the value of the map. The license was removed, the public domain option instated, and the database was immediately technically integrated with the rest of the international web of gene data.
The legal element is of course just the beginning. The entities inside the databases themselves must be named and linked, in a standard way. Consensus on a dizzying array of technical standards must be achieved through working groups and hard won agreement. Semantic agreement - or disagreement - must be enabled where possible, and managed through savvy technology where not possible. But if the entire system must begin with a complex set of legal terms and conditions, and be subject to the kinds of injunctions and property claims so familiar from the creative world, it is inherently unstable and unlikely to interoperate.
We have seen the public domain option work, again and again, across the scientific disciplines. Implementing the public domain as the interoperability standard for the legal dimension of the web of data holds the greatest promise for scalability and long-term achievement of the network effect for data, as it permits the widest range of experimentation and development at the technical and semantic layers.