On the Digital Perimeter

Tuesday, 21 September 2010

Reclaiming Data in Digital Curation

For a while I also subscribed to the view that digital and data curation may be the same thing - a view reaffirmed upon reading Alex Ball’s detailed and informative survey of the digital curation/preservation field. However, recently I had to re-examine my position. I realized that such an approach makes the work of curating the actual data sets much more difficult, as it diverts the attention of the curator from the point where the data are actually created or accessed and worked with, and instead makes one focus on more preservation-oriented stages that come much later in the data life-cycle. The shibboleth of OAIS even further corroborates this situation. If one looks at the OAIS life-cycle – it does not even consider creation of data – it’s an input, a SIP-to-be of which we know nothing – an unspecified activity on the fringe. Only after a data object is appended with representation information and properly packaged does it enter the OAIS picture. The file stream itself can be a known and valid format, we can know who created the data object, when, and even what application was used for generating it, but the data themselves can have a minimal information value, because the data were not collected or annotated properly.

In this sense, the DCC data model appears to be more useful, as it depicts the whole trajectory of the data. And it will be beneficial for everyone involved in the data life-cycle to be aware of all those phases. Elizabeth Yakel in her overview of the digital curation field* rightly stresses the “active and potentially interactive process” of curation. Even if I have rarely seen the term “records creators” in the curation literature, the term data producers or data creators are used more frequently.

I am also not opposed to the term curation as a pendant to preservation. From the perspective of an archivist or librarian, collection may seem more appropriate, but to me that implies possession of or some kind of control over the data, but in the research data lifecycle this does not have to be the case. The data is often in active use in the Re-use and Transform phase: the data can be transferred to different units, used for plotting and further analysis – the data files are in this phase are very transient and may not be preserved in that state at all.

The active involvement of data curators in the process of data production, often from the very beginning, is probably the greatest difference between curators and archivists who usually deal with material that is past its “active life.” I do not think that the major thinkers behind archival theory argued for “original order” out of professional shyness, but because they thought that the original creators of records kept their records in an effective and functional order. That may have been more or less true in their time. However, the business practices and means of personal communication have changed dramatically in last two or three decades. Nowadays even archivists try to be pro-active and advice lay audience how to take care of their digital files and their personal communications through initiatives like How to Preserve Your Own Digital Materials.

It seemed to me that many of the articles that tried to define digital curation and describe its current state of affairs take a position which is more about accenting the preservation aspects of the lifecycle. This may be because the mechanics and infrastructure for preservation is more general and also transferable from one domain to another, but for access to and transformation of the actual research data files was not always adequate maybe because subject expertise provides an effective barrier for a general treatment of that issue. Many texts have elaborate on preservation extensively, but data curation is usually just mentioned as a term replaced or superseded by the digital curation, as if we knew everything we need to know about how to create and manage research data, but that's just not the case. Not yet.

Tuesday, 6 July 2010

Scoping Study and Implementation Plan Released

The JISC-funded Incremental project has released its scoping study into researchers' data management needs at the Universities of Cambridge and Glasgow. The re is also available form a a blog dedicated to "improving and increasing electronic research data management in UK higher education institutions."

The report contains several recommendations that should improve access to research data. Some of them there oriented towards researchers, some of them towards repository.

1) Produce simple, visual guidance on creating, storing and managing data
Produce flow charts, diagrams, FAQs, checklists
It is possible to use current tools, but they need to be simplified, the language needs to be free of professional jargon, and if possible they should contain some visual clues.
The report mentioned the ongoing Mellon Foundation-funded revisions of the Archaeological Data Service (ADS) Guides to good Practice as an example for implementation.

2) Offer practical data training with discipline specific exemplars
The report stated that researchers were keen on consulting the guidelines or

"flexible modes of training such as online tutorials, video case studies, etc. The main targets for this kind of research are PhD students and researchers just starting their careers. This is particularly interesting and important, since many of them are later assigned data curatorial tasks."

3) Connect researchers with support staff for tailored advice, guidance and partnering
This point is mainly concerned with guidance and support for writing sections of grant applications that deal with data management. This is of growing importance as many major grant agencies want to see some kind of data management plan. See NSF to Ask Every Grant Applicant for Data Management Plan

4) Work towards the development of a comprehensive data management infrastructure
Researchers asked for more storage options, including a slow, BUT reliable system that would make digital preservation possible. While the storage is of great importance, sharing of data, and easy of access do also need to be taken into consideration. Such repository should also provide support for the researchers, hence the procedure documents and guidelines should be readily available and at hand.

The attached to the report are interview templates that can help with crafting internal policies needed for the digital preservation audit.

The language of those policies is really important and the blog entry on the scoping study makes clear: "Researchers and support staff tended to be suspicious of ‘policies,’ which sound like hollow mandates, but were sometimes receptive to ‘procedures’ or ‘advice’ which may be essentially the same thing, but convey a sense of purpose and assistance rather than requirement". This may seem like a linguistic nuance, but it the word choice sends an essential message.

Tuesday, 17 November 2009

Virtual Machine - Pre-installed?

The idea of a pre-installed Virtual machine is an appealing thought. While it served its instructional purpose in 672 when it was important to understand how a server is set-up and see all those tasks which need to be accomplished in order to have an operational server, it became rather a waste of time in 675 to start every new software package with a new clean standard installation of VM. It became a mechanistic routine which I did automatically, so I was not learning anything new from the process and the installation was taking valuable time that I could dedicate to installing the actual software package. Occasional typos in /etc/network/interfaces were a particular source of frustrations, and I do not think I learned much from retyping 168 instead 186. The other problem was related to the hosts file which resides on the host machine and which many may not have even changed and used the 192.168.X.3 fixed IP for all their application, anyway.

I probably would be satisfied with pre-configured "standard" Ubuntu install and manual install in case of less typical installations; such was in the case DSpace which does not use the usual LAMP set-up in order to refresh the whole process in that way the pedagogical purpose of that exercise would still be valid.

Tuesday, 3 November 2009

OAI-PMH and Collection Development

I have been thinking about the opportunities OAI-PMH has to offer for a while. At the beginning my acceptance of OAI-PMH was rather unreflective, it was just the right thing to do, something that helps to connect various digital resources. The literature usually did not provide too many guidelines as to how OAI-PMH can be employed as a collection development tool. There were a lot of articles about OAI-PMH, but many of those materials were rather technical and OAI-PMH was treated just as a tool to populate databases with records from other repositories. Interoperability was often mentioned, but there were few case studies that would show how these distributed resources can be aggregated in a meaningful way that would complement material offered by the institution. As a result there is not sufficient granularity in providing records for harvesting, few institutions offer meaningfully created sets. Those sets which exist are often indiscriminate aggregations of resources produced under different projects or by different agencies.

The OAI-PMH was developed for institutional repositories which often exist separately from special collections or archives within university libraries and they may be one of the reasons, why this technology has been underused in heritage repositories. Those huge pools of records with no clearly defined scope and audience had little to offer in terms of collection development and build-up, they could hardly supplement one's own material with complimentary resources from other repositories with similar area of interest. However, recently a number of OAI-PMH service providers appeared that harvest records from more narrowly specified sets, these services are mostly tied to a project, so the metadata seem more consistent than. This is the case of the Sheet Music collection that even posted its cataloging guidelines. Another project with a clearly defined scope, even if broader than the Sheet Music project is American Social History Online. Projects like this allow a high level of customization of their services. They can provide users with more precise and useful results, a metadata filter can be used that brings browsing users almost effortlessly several levels deep into collection hierarchy to resources they seek.

The use of Web 2.0 tools and intelligent use of client-side scripting can make useful browsing and searching even within more general aggregations. I was particularly impressed with the ELib service administered by University of Bremen, Germany that also presents harvested material in very intuitive and effective way. Taking advantage of subjects headings for creating browseable hierarchies, but also tag clouds for keywords and further refinement of query.

The DLF (Digital Library Federation) OAI Portal is a reminder of an earlier period when the OAI-PMH served merely for aggregating material from various resources without any further manipulation and repurposing of metadata. Users can use search or browse two browsable hierarchies, one based on subject headings that however are still very broad and then one based on data providers, but no other tools for narrowing the record sets are available. The records obviously originated in various formats and were based on different rules, and the effort to normalize them was rather limited.

In order to take advantage of the OAI-PMH and make it a useful tool for metadata sharing that can help to round out and complement virtual collections, the data providers need to make sure that the records are available in meaningful granular sets. Clearly defined metadata sets and cataloging that follows accepted standards and takes into account new contexts in which metadata can exist, make it possible for other repositories to integrate these records into their collections, and in such a way to provide additional exposure to those resources.

Tuesday, 27 October 2009

Consistency of Metadata

I am a strong believer in the sharing and interoperability of metadata, therefore I am trying to advocate descriptive standards both regarding syntax of elements and their content. I decided to use controlled vocabularies for my collection; most of them are well-established and time-tested within the cultural and heritage institutions, such as Thesaurus for Graphic Materials, Library of Congress Subject Headings. However, for the style element, which is a VRA-inspired extension of the DCTerms set I use a local short list that was prepared for the collection.

It is not only subject headings that I try to control in order to minimize errors and typos, in all tested applications I tried to come up with a drop-down menu for languages used within my collection, because terms like Yiddish or Lithuanian can cause difficulties even to experienced catalogers.

In order to make the retrieval of metadata functional and effective I try to keep values of metadata fields simple and relatively short, so that the data would display properly and did not conflict with layout of the page. Shorter entries are also easier for users to scan on the result screen.

One of the difficulties I have to face is that the collection is relatively small and thematically dispersed, so it is relatively difficult to come up with a good browsing categories, therefore I tried to choose broader access terms rather than specific ones.

Tuesday, 13 October 2009

OAI-PMH and Benefits of DC

For a while now I have been wondering about metadata interoperability. Günter Waibel and Mary W. Elings demonstrated* that interoperability is possible even if different communities use different metadata standards, or more in the spirit of the article, even if different materials are described by different standards. OAI-PMH is essential for this type of interoperability, but the OAI-MPH is just a tool - a protocol for exchange or sharing, but in fact what makes the exchange possible is Dublin Core.

I was never a big fan of Dublin Core, whether in its qualified or unqualified form. I was always skeptical that the effort to generalize the concept of description and remove it from the material to be described does not bode well for practices in cultural and heritage institutions. However, a title is undeniably a title whether it is the title of a book, or of a painting or of an archival artefact. Based on a descriptive standard, the title does not have to be always constructed the same way and look the same, but basically the concept is understandable - the title is that property under which an object is usually known. Once I accepted this truism, my opposition to DC as an intermediary layer became less intense.

OAI-PMH was one reason I changed my mind, but the other was DigiTool and its mapping file harvesting_schema that is based on the modified extended qualified DC, and which effectively manages to channel data from various metadata standards into descriptive facets that are then present to a user in resource discovery. There is little chance that the user will recognize the native format of the metadata, some residual delimiters may give away a MARC record, but content-wise the harvesting_schema allows for a lot of flexibility. It is also extensible , so one is not bound by the DCTerms set.

When it comes to description, I am in favour of MODS, but I can live with MARC as well, but I started to appreciate the fact that there is a light-weight DC somewhere out there. And I am glad that we can make metadata available in OAI-PMH in both formats in addition to DC elements or more precisely OAI_DC.

*Metadata for All: Descriptive Standards and Metadata Sharing across Libraries, Archives and Museums by Mary W. Elings and Günter Waibel. First Monday, volume 12, number 3 (March 2007), URL: http://firstmonday.org/issues/issue12_3/elings/index.html (Accessed on 2009-10-13)

Monday, 28 September 2009

Drupal Experiences

Installing Drupal was quite an experience. I have downloaded several image-rendering modules, mostly because I was not sure how they worked, so as a result some are installed without probably being used and duplicating the image.module .
I felt that the taxonomy part needed some additional features. I downloaded and installed the tag cloud block – Cumulus and Taxonomy Breadcrumb. Cumulus provides a user-friendly overview of available terms and the quantity of these terms is also visually presented in an intuitive way.

Taxonomy Breadcrumb, on the hand, provide users with a functional clue about their momentary location and allows them to backtrack within a collection. However, this feature would be much more useful had I had the collection more structured and more deeply nested.

I feel that Drupal is a useful presenting tool, relatively easy to use and to set up. The set up is definitely more difficult, but I think that a Drupal site can be indeed designed in a very user-friendly way both for the user-visitor of the site and for the user-content creator. Who, on the on the hand, does not have to have web design or html skills in order to produce aesthetically appealing and easy to navigate pages.

Drupal, however, is more about access to digital resources, not necessarily suitable for their managing and for tasks required for digital preservation. I can imagine that Drupal can work well in tandem with a DAMS application that would handle the storage and manipulation of various manifestations, and Drupal could enable access to the view manifestations and appropriate descriptive metadata.