Tuesday, 21 September 2010

Reclaiming Data in Digital Curation

For a while I also subscribed to the view that digital and data curation may be the same thing - a view reaffirmed upon reading Alex Ball’s detailed and informative survey of the digital curation/preservation field. However, recently I had to re-examine my position. I realized that such an approach makes the work of curating the actual data sets much more difficult, as it diverts the attention of the curator from the point where the data are actually created or accessed and worked with, and instead makes one focus on more preservation-oriented stages that come much later in the data life-cycle. The shibboleth of OAIS even further corroborates this situation. If one looks at the OAIS life-cycle – it does not even consider creation of data – it’s an input, a SIP-to-be of which we know nothing – an unspecified activity on the fringe. Only after a data object is appended with representation information and properly packaged does it enter the OAIS picture. The file stream itself can be a known and valid format, we can know who created the data object, when, and even what application was used for generating it, but the data themselves can have a minimal information value, because the data were not collected or annotated properly.

In this sense, the DCC data model appears to be more useful, as it depicts the whole trajectory of the data. And it will be beneficial for everyone involved in the data life-cycle to be aware of all those phases. Elizabeth Yakel in her overview of the digital curation field* rightly stresses the “active and potentially interactive process” of curation. Even if I have rarely seen the term “records creators” in the curation literature, the term data producers or data creators are used more frequently.

I am also not opposed to the term curation as a pendant to preservation. From the perspective of an archivist or librarian, collection may seem more appropriate, but to me that implies possession of or some kind of control over the data, but in the research data lifecycle this does not have to be the case. The data is often in active use in the Re-use and Transform phase: the data can be transferred to different units, used for plotting and further analysis – the data files are in this phase are very transient and may not be preserved in that state at all.

The active involvement of data curators in the process of data production, often from the very beginning, is probably the greatest difference between curators and archivists who usually deal with material that is past its “active life.” I do not think that the major thinkers behind archival theory argued for “original order” out of professional shyness, but because they thought that the original creators of records kept their records in an effective and functional order. That may have been more or less true in their time. However, the business practices and means of personal communication have changed dramatically in last two or three decades. Nowadays even archivists try to be pro-active and advice lay audience how to take care of their digital files and their personal communications through initiatives like How to Preserve Your Own Digital Materials.

It seemed to me that many of the articles that tried to define digital curation and describe its current state of affairs take a position which is more about accenting the preservation aspects of the lifecycle. This may be because the mechanics and infrastructure for preservation is more general and also transferable from one domain to another, but for access to and transformation of the actual research data files was not always adequate maybe because subject expertise provides an effective barrier for a general treatment of that issue. Many texts have elaborate on preservation extensively, but data curation is usually just mentioned as a term replaced or superseded by the digital curation, as if we knew everything we need to know about how to create and manage research data, but that's just not the case. Not yet.

Tuesday, 6 July 2010

Scoping Study and Implementation Plan Released

The JISC-funded Incremental project has released its scoping study into researchers' data management needs at the Universities of Cambridge and Glasgow. The re is also available form a a blog dedicated to "improving and increasing electronic research data management in UK higher education institutions."

The report contains several recommendations that should improve access to research data. Some of them there oriented towards researchers, some of them towards repository.

1) Produce simple, visual guidance on creating, storing and managing data
Produce flow charts, diagrams, FAQs, checklists
It is possible to use current tools, but they need to be simplified, the language needs to be free of professional jargon, and if possible they should contain some visual clues.
The report mentioned the ongoing Mellon Foundation-funded revisions of the Archaeological Data Service (ADS) Guides to good Practice as an example for implementation.

2) Offer practical data training with discipline specific exemplars
The report stated that researchers were keen on consulting the guidelines or
"flexible modes of training such as online tutorials, video case studies, etc. The main targets for this kind of research are PhD students and researchers just starting their careers. This is particularly interesting and important, since many of them are later assigned data curatorial tasks."

3) Connect researchers with support staff for tailored advice, guidance and partnering
This point is mainly concerned with guidance and support for writing sections of grant applications that deal with data management. This is of growing importance as many major grant agencies want to see some kind of data management plan. See NSF to Ask Every Grant Applicant for Data Management Plan

4) Work towards the development of a comprehensive data management infrastructure
Researchers asked for more storage options, including a slow, BUT reliable system that would make digital preservation possible. While the storage is of great importance, sharing of data, and easy of access do also need to be taken into consideration. Such repository should also provide support for the researchers, hence the procedure documents and guidelines should be readily available and at hand.

The attached to the report are interview templates that can help with crafting internal policies needed for the digital preservation audit.

The language of those policies is really important and the blog entry on the scoping study makes clear: "Researchers and support staff tended to be suspicious of ‘policies,’ which sound like hollow mandates, but were sometimes receptive to ‘procedures’ or ‘advice’ which may be essentially the same thing, but convey a sense of purpose and assistance rather than requirement". This may seem like a linguistic nuance, but it the word choice sends an essential message.