This post was prompted by the combination of three events: a visit with the founder of PubGet, an invitation to keynote at a conference on publishing, and an interview with Bora about the Science Online 2009 conference last January in RTP.
The past year has seen an explosion of talk about the future of the scientific article. It's wonderful to see, even if the results are either depressingly complicated to achieve or depressingly incremental innovation. Both of those results are better than when I got into this - I remember at a conference in Sweden in 2006 hearing a grand high priest of the publishing industry argue that they'd gotten this whole digital publishing thing sorted right out...that attitude was the first thing that needed to change. Glad it has.
I've been hammering for years now on the need to enrich articles with semantics. My talk at that conference in Sweden was probably the first good one I gave on the topic, and it's been an leitmotif for me going back to the mid-1990's when I was studying epistemology and getting my first real exposure to networked computers. For years I was convinced it was right around the corner.
That semantic publishing future now feels closer than it ever has. But I'm actually less convinced it's around the corner than in years past, and the reasons for that are human, not technical.
To be clear: in the following, I'm going to be talking about narratives and text, not about databases. The semantic future for databases and data is already here, but to paraphrase William Gibson, it's just unevenly distributed. Those of the argument that the Semantic Web isn't going to work have already lost the argument. You just don't see it, because it's an infrastructure upgrade to the back-end of the Web to make it work for data.
But the impact of formal semantics on text, which is what humans interface with, has been negligible. It's had nowhere near the impact of tagging and folksonomy. That's driven me, and many others who like formal semantics, crazy.
The benefits to a formal semantic approach to text are so obvious: we can start to treat knowledge as a graph, and we can even maybe start to get some network externality benefits to that knowledge. Make it more valuable via the network...one fact is like one fax machine, but many facts build a hypothesis, etc. etc. etc.
Beautiful dream. Not going to happen anytime soon.
The problem is that people are the writers. Humans. Not machines. Machines luuuuuv semantics. Otherwise they can't tell the difference between a picture and a pitcher (or between a pitcher of water and a baseball pitcher). This is why one should never send one's mother to buy jewelry via Google without the safe browsing mode enabled.
And people don't like formal semantics. I majored in formal semantics, and it's a topic that still gives me headaches.
People like stories.
Scientists are people.
Scientists like stories.
A paper is a story. It tells, in its own way, the story of years of work. Of building expertise. Of designing falsifiable hypotheses. Of the results found in the lab. Of the search to balance those results against the canon and dogma. Of the potential ramification of the results.
It's a story of science. And the telling of it is an important part of being a human who does science.
A recent article in PLoS Genetics states that "Fission Yeast Tel1ATM and Rad3ATR Promote Telomere Protection and Telomerase Recruitment" - now, those are the key "facts" asserted. They could be written into machine-readable format. I will spare you what that would look like. Suffice to say it's eye bleedingly ugly, and requires lots of agreement about unique identifiers. It's doable. It's being done for the databases and that will eventually make it possible for the literature. It's just not fun. And it ignores the story.
It reduces the research tale to a few assertions, nested into a massive graph of stuff other people asserted. While this is great for machines, it is lousy for people.
This is all leading up to an idea I'm working on for the talk later this month. Publishers need to be in the business of providing the service that translates the stories for the machines to understand. The Web makes it trivial to publish stories in human readable form. All the beautiful layout services and print services that used to be worth paying for...aren't. Peer review isn't free, but it's nowhere near as expensive as it's made out to be - and it's going to get transformed by the Web, too. The Web makes peer review massively more powerful as it makes it massively more democratic. The Web kills a lot of things that used to drive value in content, especially controlled content.
After all, I can't remember the last time I used a Zagat's guide. Not when I have Chowhound. It's going to come to science. Don't know exactly how, but it's coming.
But this only covers one piece of science - the telling of the story. There's another key, which is the ability to use the information to write a new tale. The ability to take this massive corpus of story and turn it into something that can be modeled, that can be used by humans and machines together to draft new stories...that ability is going to require the emergence of publishers who understand their role in the new content economy. It's not as printers who use bits rather than ink. It's as translators between the human stories and the machines who have to take those stories, integrate them into a web of linked data, and make it possible for humans to ask questions, dream dreams, and tell new stories.
The semantic article isn't going to come from individual scientists rebelling and marking up their own text. It's going to be a publisher value-added service - "let us make your article integrated, and comprehensible, so that you maximize your citation count and potential collaboration."
Sounds good, doesn't it?
Focusing on the control of copies of the article, of the story, isn't just a losing strategy because of the open access movement, although it is that as well. It's the wrong concept entirely. Translation is a service for which authors would gladly pay. For which searchers would gladly pay. And it's a market that is going to get more valuable as a result of open systems, not less valuable, as the cost of controlled scientific published content drops thanks to green and gold open access.
Think about Clayton Christensen's law of conservation of attractive profits: "When attractive profits disappear at one stage in the value chain because a product becomes commoditized, the opportunity to earn attractive profits with proprietary products usually emerges at an adjacent stage."
Publishers are trying to fight the commoditization of the story. They shouldn't. The vast majority of the stories are bought and paid for by the public one way or the other. Publishers should be looking at the place where they can compete on proprietary services, and taking over those markets before their competitors - or startups - beat them to it. There is enormous opportunity in the emerging open access world to make money without needing to vigilantly police the movement of content.
Help the scientists tell their stories in a way that lets those stories integrate into the digital web. Don't just gussy up a paper version of a story with hyperlinks. Don't focus on controlling the movement of stories. They're sand in your hands once they're on the network. Embrace that fact. Find the value in the next layer, the service layer.
Be a guide. Be a search engine. Be a translator.
atheists caused 911 - treat them accordingly
you have forfeit your life
Even some of the top pseudonymous bloggers who are out there busily defending pseudonymousness can't seem to avoid conflating opinions and statements that people are making, taking any use of the word "pseudonym" as an anti-pseudonymous act regardless of what is actually being said, if the term is uttered by someone they have classified as "against" them.
I like this vision.
I do not think publishers will.
We'll see, I suppose...
I believe your argument that services are where publishers will make their revenue is 100% spot-on, and providing easy means of encoding unstructured knowledge to make articles more usable is definitely one of the value-addition steps they can provide.
I agree - we as humans are trained from an early age about story telling, not semantic structure. I wonder in the Information Age when we'll start including the instruction of semantics to young children. It sounds almost outlandish, but if we think about how integral machine understanding is to human culture today it may become an important part of the curriculum...
Great post, John. I believe your argument that services are where publishers will make their revenue is 100% spot-on, and providing easy means of encoding unstructured knowledge to make articles more usable is definitely one of the value-addition steps they can provide.
I do wonder, however, whether scientists will bite, though. Since I'm one myself I can say they are notoriously cheap when it comes to information tools, or at least, biologists are. This situation is a special case of the poor valuation of software and is not unique to scientists. As far as I can figure out, there are only three realms of human activity that believe in paying for the true value of information: the military, the financial world, and, to a certain extent, attorneys. Hopefully, scientists will eventually mature toward a better valuation model of information and associated tools to enable publishers to take the plunge.
Could you use some combination of the crowd (readers/users) and a starting mark-up to generate an evolving, dynamic picture similar to what is used in a 3D thesaurus (see visual thesaurus as example: http://www.thinkmap.com/)? In my head, when I envision the relationship among articles, they form a mindmap similar to what thinkmap generates -- nodes,brightness, distance from a core, etc (and dynamic, which is something the google wonderwheel doesn't seem to do). So, if a science publication were entered into a web (perhaps using MeSH headings as a starting point?) with some pre-defined relationships (there've been conversations about this on FF...) where those could be continually refined as users made comments? Webs aren't exactly individual narratives, but are representative of narrative universes, and any individual piece could be seen as more or less representative of different starting points. Starting points could be as different as topics and time points -- users could abstract chronologies of a research idea just based on this kind of web or even (I think) argument structures -- lines of thought pertinent to a research topic.
What wonderful narration of relationship between humans and machines,ways of earning profit from open sources,etc in this story....Great job John.
It was a great pleasure reading such informative post.
O this is wonderful.
In my experience with managing metadata in a newspaper environment, what I learned was that humans are lousy at it, and that systems without natural feedback incentives quickly fall into entropy (as in "What's this got to do with a newspaper? Fuck metadata!"). Solution? We need machine-generated metadata systems, and to get that, you have to teach taxonomies and controlled vocabularies to machines. And yes, you could then put humans in the position of doing QA behind the machines, but it's probably better to run an audit program across data sets first and then just review the results.
How many scientific disciplines have generated their own XML schema? Their own detailed taxonomies?
What are your thoughts on inline semantic coding via RDF?
Oh, and to the comment on entities that will pay for quality information in the right formats, I'd add insurance companies, fantasy sports enthusiasts, gamblers, marketers and demographic researchers.
Anyway, great post.
Does a conference on publishing actually smell like death? I really marvel at how these buggy-whip manufacturers keep persisting and telling themselves that the automobile really isn't that big of a threat. Publishers exist, primarily, to solve the distribution problem. That used to be extremely valuable... now, it's worthless. In fact, having a large organization in place makes the way they can solve it less than worthless, it actually destroys value. I can solve the distribution problem better from my bedroom thanks to the Internet.
Publishers may still have something to offer (I seriously doubt it, but I grant the possibility) in the way of concentrating on the other services they can provide (though I imagine editing, marketing, and many others will transition to individuals working from home doing it online, the corporate structure again destroys value rather than creating it), but they will have to completely reorganize themselves and prepare for massive downsizing. They might be a $10 million per year industry if they manage to really outperform all the individuals who will be offering the same services cheaper and better. I'm not sure what would make a lumbering corporation better at providing semantic markup for work than an individual working part time from home over the Internet, but maybe they could try it as a last gasp. Let's just hope that in their death throws they don't do what the video distribution companies are trying to do and put legal and practical limits on Internet access for most people just to try to invent artificial value for their worthless services.