Digital Library Thoughts: October 2009

Friday, 9 October 2009

Notes from Sun Pasig - Fri morning session

A more discursive entry today

Listening to the various presentations over the past couple of days it occurs to me that the kind of solutions that people come up with and the kind of issues they face are infuenced by a number of factors:

The problem that they are trying to solve
The remit of the organisation that they work for
The culture of that organisation
The character of the person that is trying to solve the problem.

In discussion this comes out as a tension between people who empasise adherence to standards and planning exactly what you a going to do and how before you do it, and those who are more emphasise being flexible to peoples needs rather than trying to change them and trying to get to 80% rather than 100% (I'm really trying hard not to let my prejudices show here).
To give a couple of examples:

The Family Search people are dealing with vast amounts of scanned images of family records. So for them the scale is the problem but file formats and metadata aren't (they're all the same)
The Oxford University people are trying to provide services for people who want to run their own archives and to do something sensible with stuff that literally turns up at their door. So for them, scale is a bit of a problem (although nowhere near that of Family Search) but they have a huge problem deciding even how to classify what they have, let alone encoding that in some sort of metadata schema.

The danger is that people push each other apart and we end up with two camps, the Calvinists and the Maeneds (worshippers of Bacchus or Dionysus), shouting at each other from opposite sides of the conference floor.

I think that in reality there is a spectrum of needs and issues and we should acknowledge that each institution inhabits a different range in the spectrum.
I have had a bit of a think and have identified a number of axes that determine the kind of needs that an organisation will have and the issues that the organisation will face. I have also had a go at placing the BL on these axes/planes and showing the direction of travel (that is, this is where we are now and this is where we expect to be in the future):

There is an axis with Linear Ingest workflow at one end (mainly Libraries and Archives) and iterative creation workflow (Universities and Research groups using a publishing workflow or Virtual Research Environment) at the other

There is a triangle with unstructured heterogeneous data as one vertex, structured homgeneous data as a second and structured heterogeneous data as the third

There is an axis relating to the amount of engagement (ceremony) with providers of data before the providers start to deposit. At one end there is low ceremony, where you tell the provider where to put it (could even just be your postal address) and you sort it out, and the other end is high ceremony where you spend months deciding on data formats and metadata standards before you allow the depositer to place a single object.

There is a triangle with document-centric deposit at one vertex, database centric at the second and system centre (e.g. deposit of a web application) at the third
There is an access with preservation at one end and access at the other.

Thursday, 8 October 2009

Notes from Sun Pasig - Thurs afternoon session

Columbia University Libraries

Using Fedora 3
Two copies on disk, two copies on tape
Capacity of 70TB
Developed Hypatia, a tool to enable non-programmers to create input forms and workflows for metadata schemas and then catalogue them in a controlled environment
Investigating blacklight

P2N - University of Southampton/Oxford University

Distributed archiving using bittorrent concepts. Each object is split into parts with redundancy so that parts of the file can be lost without losing the whole file
The idea is that, if you want to take part, you install the software and donate half of your storage to the other participants
Proof of concept done
Use a REST api (put, get, post, head, delete) to access the files

University of Queensland - Keith Webster

Fez - Content Management System on top of Fedora
Australian academics have to deposit all publications into institutional repositories. Australian Government rewrote the copyright laws to make this possible
Using the ResearcherID service (from Thomson Reuters) to identify researchers uniquely
ResearcherID shows citations info from Web of Science

Notes from Sun Pasig - Thurs morning session

Planets - Adam Farquhar

Planets are establishing a not-for-profit organisation to ensure the long-term sustainability of Planets framework

University of Witwatersrand, Johannesburg - Derek Keats

Built a private cloud infrastructure for the entire university that hosts standard services (e.g. e-mail, eLearning portal) and Digital Library/Archive functions

National Library of New Zealand - Kevin de Vorsey

Use criterion of ability to render or migrate a file (formats can hide a multitude of sins e.g. container formats like MPEG 4) to determine the level of risk
[opinion] Perhaps adoption of file-formats and tools should be a criterion? If a file format is popular then there is a much greater chance of it being preservable.

Family Search - Gary Wright

Ingest and publish 2.75million images per week (~30TB)
Volunteer-based programme to index images. 100,000 volunteers indexing 1million names per day
Want to double ingest rate by 2010
Low quality access copy on disk (90% compression)
Archive high quality JP2K on tape (50% compressionn over tiff)
Simplify rights management - only staff user roles can access archive. Rights management is asserted at search stage so users only see what they can access.
Goal to validate each tape at least once a year but not doing it yet
Refresh of on-line storage is a looming problem

USC Shoah Foundation Architecture

Preserve visual testimony of holocaust survivors
52,000 interviews, 105,000 hours
Adding Rwanda genocide testimony
235,000 video tapes to be digitised
indexed manually in 1 min segments (people, places events). Indexing takes 2 hours per hour of video at a cost of $25million
Store low quality access copy on disk. Volume is 135TB
Store high quality archival copy on tape. Capacity of library is 8PB
Replace tapes every 3 years
All objects are signed and checked at every stage in the process. Biggest problem is network cards flipping bits during transfer!
Use physical transfer to bring back video from Rwanda, in chunks of ~140TB

CDL - John Kunze

Permanent Objects, Disposable Systems.

Wednesday, 7 October 2009

Notes from Sun Pasig - Weds afternoon session

Sun PASIG

Library of Congress

Planning to digitize 1Exabyte of content by 2012. Estimate that this will cost $30m per year in power!

Internet Archive

4.5PByte in the Archive
Planning to add 2PByte to the web archive this year
Planning to change web crawling policy. At the moment, decide what domains to crawl and then store everything found. In future will crawl more widely and then decide what to store
Serve 400-500 requests per second
Using Reed Solomon algorithms for replication & restoration
Moved from 100MByte Arc files to 1GByte warc files

Oxford University - Neil Jeffries

Lots of legacy digital data ~200TB (mostly maps)
Lots of discrete archives with different technologies implemented using virtualisation and a common storage infrastructure
Acquire digital donations as pcs/disk images/ tapes/floppies
Have to archive web servers
Have an approach of preserve before curation that is, ensure that the data is stored and catalogue later
Charge £5k/TB one-off cost for archiving
Use Fedora
Restore from backup is unrealistic so don't backup to tape
Objects are versioned rather than being deleted/overwritten

Stanford - Tom Cramer

80TB in repository, well over 200TB not in repository
Started with preservation focus, now realized that other requirements are equally important
Implementing the Digital Object Repository (DOR) using Fedora - similar to DLMACS
Moved away from METS to much simpler object model, based on FOXML
Took too long to add new content types so they de-emphasised curation and move away from 'Just in case' to 'Just in time'

Hathi Trust

Over 4 million items (mostly Google digitised books)
Use the OCLC number to identify and deduplicate items
Use Aleph
Only ingest TIFF, JP2K, OCR text file and METS file
QA 1% of what is ingested
Uses the Pairtree directory structure
Planning to roll out full-text search via Solr
Sharded indexes using 5 instances of solr
People can build their own collections which are indexed separately
Using Digital rights management to allow disabled users (with Shibboleth authentication) to access the full text of in-copyright books

SHAMAN

'Securing Communication with the Future'
There are over 500,000 libraries in the world and over 70,000 national archives so digital preservation has a very low penetration into the market

Notes from Sun Pasig - Weds morning session

Intro from Art Pasquinelli:

Libraries and IT departments will move to the side in Digital Archiving and Content will be at the centre
There is a problem with 'Long Tail Data' - we don't know what we want to keep
The cost of power to keep the disks spinning is a hot issue
How do you transfer a Petabyte?
There is an increasing amount of discussion in companies around how to build 100 year archives
Pharmachem industries will push the linking of research data and papers.

Mike Keller

Federating large repositories is starting to happen
Lots of people are talking about storage in the cloud but are worried about security and Service Levels (or lack of)
Discovery is a huge problem and current metadata standards (marc, METS) do not solve it
Audit is an issue that more people are becoming aware of - can we prove that the stuff is there and will continue to be there? Should we be publishing our audits?

Thorny Staples - Duraspace

Fedora scalability - Sun have just tested a Fedora instance with 150million objects and found that ingest performance was flat over that volume as was access performance
Plan to make Fedora more modular and attract new developers to the Fedora community
Improving Fedora docs
Duracloud - adds integrity checking and other archiving services to the basic cloud storage.
They are talking to a number of vendors of cloud storage to put Duraspace on top of their clouds
Opportunity for the DLS to be a 'cloud' for Duraspace

Islandora

Very impressive automated workflow for book digitisation including conversion from TIFF to JP2K, OCR, TEI (significant terms, people, orgs etc) extraction and an editor for correcting OCR
Virtual Research Environment including auto DRM
iPhone app for data collection - uses user id to present the correct data collection interface to the user
FeSL better rights management than XACML in Fedora

Biodiversity Heritage Library

Part of Natural History Musem
Similar Archiving Architecture model to DLS. Fedora for metadata access, storage abstracted through Duracloud
Building a large datacentre off J16 fo M6 on old airfield.
500 year business plan
Keen to collaborate with other long-term organisations

EPrints

10 years old
Implemented OAI-ORE and migrated an entire archive from EPrints to Fedora and vice versa
Impressive demonstration showed links to a citation service so you could see the citations of articles in the archive
Offer support to enterprises on a commercial model

iRods

Policy-based federated data archiving
Very impressive but I suspect that deciding what your policies are and designing the workflows to implement the policies would be hard (i.e. It might take years)

Digital Library Thoughts