NIH/PubMed, etc: The future of large open databases

Thanks to Peter Suber:

The Database Revolution, Nature, January 18, 2007 (accessible only to subscribers).  An editorial.  Excerpt:

…Which strategies best support the collection, analysis and dissemination of large databases of related information?…At a meeting in Bethesda, Maryland, last month, it was clear that the NIH is struggling to find a middle road between two diametrically opposed approaches to the development of such databases. Top-down pressure by the agency on researchers to use certain software or formats would probably impede their development. But a bottom-up strategy that merely encourages cross-project cooperation, while allowing researchers total freedom to devise their own databases, is bound to be chaotic, does not guarantee cross-compatibility of data, and ultimately reduces the likelihood that their contents will be used to maximally benefit research.

What is clear is that individual labs can no longer make much progress alone. Currently, many researchers feel they are drowning in data. For all they know, a database might contain answers to patient safety issues or glimmers of new therapeutics — but this is being lost through an inability to effectively harvest the data already available. Other opportunities are missed because both experts and data are ‘silo’-ed in isolated and often inaccessible systems. On top of these issues is the fact that neither databases nor the experts that create them are permanent or inseparable.

The NIH and its equivalent agencies elsewhere in the world are now turning their attention to working out how best to assist the growth of validated and accessible databases. This should involve, at the least, development of policies for evaluating proposals on databases and associated analytic tools, for their sustained funding, and for ensuring that the data deposited remain accessible long after the project originators have moved on.

The NIH itself, if it chose, could aim for something grander. It could take it upon itself to define a broad reference model and the basic architecture for knowledge environments. It could even build a centralized warehouse with Google-like storage, a veritable National Biomedical Resource of raw data and the tools to access and analyse them….

But perhaps it will prove more realistic for the US agency to concentrate on improving the inter-operability of databases, rather than pushing for their merger, and to provide incentives for building in ‘joins’ from the start. The NIH should work on this with industrial companies and other government agencies….

Researchers also need stronger incentives to sustain their own participation in building knowledge environments. At a minimum, contributors should receive a citable acknowledgement of depositions. Leadership and trust are required to ensure that primary researchers personally benefit from storing their data in open databases….

Comment.  It’s striking how similar this question is to the choice between central and distributed eprint repositories.  One difference is that the interoperability of data repositories will be much more difficult to ensure than the interoperability of text repositories.