Digital Library Thoughts: 2009

Friday, 9 October 2009

Notes from Sun Pasig - Fri morning session

A more discursive entry today

Listening to the various presentations over the past couple of days it occurs to me that the kind of solutions that people come up with and the kind of issues they face are infuenced by a number of factors:

The problem that they are trying to solve
The remit of the organisation that they work for
The culture of that organisation
The character of the person that is trying to solve the problem.

In discussion this comes out as a tension between people who empasise adherence to standards and planning exactly what you a going to do and how before you do it, and those who are more emphasise being flexible to peoples needs rather than trying to change them and trying to get to 80% rather than 100% (I'm really trying hard not to let my prejudices show here).
To give a couple of examples:

The Family Search people are dealing with vast amounts of scanned images of family records. So for them the scale is the problem but file formats and metadata aren't (they're all the same)
The Oxford University people are trying to provide services for people who want to run their own archives and to do something sensible with stuff that literally turns up at their door. So for them, scale is a bit of a problem (although nowhere near that of Family Search) but they have a huge problem deciding even how to classify what they have, let alone encoding that in some sort of metadata schema.

The danger is that people push each other apart and we end up with two camps, the Calvinists and the Maeneds (worshippers of Bacchus or Dionysus), shouting at each other from opposite sides of the conference floor.

I think that in reality there is a spectrum of needs and issues and we should acknowledge that each institution inhabits a different range in the spectrum.
I have had a bit of a think and have identified a number of axes that determine the kind of needs that an organisation will have and the issues that the organisation will face. I have also had a go at placing the BL on these axes/planes and showing the direction of travel (that is, this is where we are now and this is where we expect to be in the future):

There is an axis with Linear Ingest workflow at one end (mainly Libraries and Archives) and iterative creation workflow (Universities and Research groups using a publishing workflow or Virtual Research Environment) at the other

There is a triangle with unstructured heterogeneous data as one vertex, structured homgeneous data as a second and structured heterogeneous data as the third

There is an axis relating to the amount of engagement (ceremony) with providers of data before the providers start to deposit. At one end there is low ceremony, where you tell the provider where to put it (could even just be your postal address) and you sort it out, and the other end is high ceremony where you spend months deciding on data formats and metadata standards before you allow the depositer to place a single object.

There is a triangle with document-centric deposit at one vertex, database centric at the second and system centre (e.g. deposit of a web application) at the third
There is an access with preservation at one end and access at the other.

Thursday, 8 October 2009

Notes from Sun Pasig - Thurs afternoon session

Columbia University Libraries

Using Fedora 3
Two copies on disk, two copies on tape
Capacity of 70TB
Developed Hypatia, a tool to enable non-programmers to create input forms and workflows for metadata schemas and then catalogue them in a controlled environment
Investigating blacklight

P2N - University of Southampton/Oxford University

Distributed archiving using bittorrent concepts. Each object is split into parts with redundancy so that parts of the file can be lost without losing the whole file
The idea is that, if you want to take part, you install the software and donate half of your storage to the other participants
Proof of concept done
Use a REST api (put, get, post, head, delete) to access the files

University of Queensland - Keith Webster

Fez - Content Management System on top of Fedora
Australian academics have to deposit all publications into institutional repositories. Australian Government rewrote the copyright laws to make this possible
Using the ResearcherID service (from Thomson Reuters) to identify researchers uniquely
ResearcherID shows citations info from Web of Science

Notes from Sun Pasig - Thurs morning session

Planets - Adam Farquhar

Planets are establishing a not-for-profit organisation to ensure the long-term sustainability of Planets framework

University of Witwatersrand, Johannesburg - Derek Keats

Built a private cloud infrastructure for the entire university that hosts standard services (e.g. e-mail, eLearning portal) and Digital Library/Archive functions

National Library of New Zealand - Kevin de Vorsey

Use criterion of ability to render or migrate a file (formats can hide a multitude of sins e.g. container formats like MPEG 4) to determine the level of risk
[opinion] Perhaps adoption of file-formats and tools should be a criterion? If a file format is popular then there is a much greater chance of it being preservable.

Family Search - Gary Wright

Ingest and publish 2.75million images per week (~30TB)
Volunteer-based programme to index images. 100,000 volunteers indexing 1million names per day
Want to double ingest rate by 2010
Low quality access copy on disk (90% compression)
Archive high quality JP2K on tape (50% compressionn over tiff)
Simplify rights management - only staff user roles can access archive. Rights management is asserted at search stage so users only see what they can access.
Goal to validate each tape at least once a year but not doing it yet
Refresh of on-line storage is a looming problem

USC Shoah Foundation Architecture

Preserve visual testimony of holocaust survivors
52,000 interviews, 105,000 hours
Adding Rwanda genocide testimony
235,000 video tapes to be digitised
indexed manually in 1 min segments (people, places events). Indexing takes 2 hours per hour of video at a cost of $25million
Store low quality access copy on disk. Volume is 135TB
Store high quality archival copy on tape. Capacity of library is 8PB
Replace tapes every 3 years
All objects are signed and checked at every stage in the process. Biggest problem is network cards flipping bits during transfer!
Use physical transfer to bring back video from Rwanda, in chunks of ~140TB

CDL - John Kunze

Permanent Objects, Disposable Systems.

Wednesday, 7 October 2009

Notes from Sun Pasig - Weds afternoon session

Sun PASIG

Library of Congress

Planning to digitize 1Exabyte of content by 2012. Estimate that this will cost $30m per year in power!

Internet Archive

4.5PByte in the Archive
Planning to add 2PByte to the web archive this year
Planning to change web crawling policy. At the moment, decide what domains to crawl and then store everything found. In future will crawl more widely and then decide what to store
Serve 400-500 requests per second
Using Reed Solomon algorithms for replication & restoration
Moved from 100MByte Arc files to 1GByte warc files

Oxford University - Neil Jeffries

Lots of legacy digital data ~200TB (mostly maps)
Lots of discrete archives with different technologies implemented using virtualisation and a common storage infrastructure
Acquire digital donations as pcs/disk images/ tapes/floppies
Have to archive web servers
Have an approach of preserve before curation that is, ensure that the data is stored and catalogue later
Charge £5k/TB one-off cost for archiving
Use Fedora
Restore from backup is unrealistic so don't backup to tape
Objects are versioned rather than being deleted/overwritten

Stanford - Tom Cramer

80TB in repository, well over 200TB not in repository
Started with preservation focus, now realized that other requirements are equally important
Implementing the Digital Object Repository (DOR) using Fedora - similar to DLMACS
Moved away from METS to much simpler object model, based on FOXML
Took too long to add new content types so they de-emphasised curation and move away from 'Just in case' to 'Just in time'

Hathi Trust

Over 4 million items (mostly Google digitised books)
Use the OCLC number to identify and deduplicate items
Use Aleph
Only ingest TIFF, JP2K, OCR text file and METS file
QA 1% of what is ingested
Uses the Pairtree directory structure
Planning to roll out full-text search via Solr
Sharded indexes using 5 instances of solr
People can build their own collections which are indexed separately
Using Digital rights management to allow disabled users (with Shibboleth authentication) to access the full text of in-copyright books

SHAMAN

'Securing Communication with the Future'
There are over 500,000 libraries in the world and over 70,000 national archives so digital preservation has a very low penetration into the market

Notes from Sun Pasig - Weds morning session

Intro from Art Pasquinelli:

Libraries and IT departments will move to the side in Digital Archiving and Content will be at the centre
There is a problem with 'Long Tail Data' - we don't know what we want to keep
The cost of power to keep the disks spinning is a hot issue
How do you transfer a Petabyte?
There is an increasing amount of discussion in companies around how to build 100 year archives
Pharmachem industries will push the linking of research data and papers.

Mike Keller

Federating large repositories is starting to happen
Lots of people are talking about storage in the cloud but are worried about security and Service Levels (or lack of)
Discovery is a huge problem and current metadata standards (marc, METS) do not solve it
Audit is an issue that more people are becoming aware of - can we prove that the stuff is there and will continue to be there? Should we be publishing our audits?

Thorny Staples - Duraspace

Fedora scalability - Sun have just tested a Fedora instance with 150million objects and found that ingest performance was flat over that volume as was access performance
Plan to make Fedora more modular and attract new developers to the Fedora community
Improving Fedora docs
Duracloud - adds integrity checking and other archiving services to the basic cloud storage.
They are talking to a number of vendors of cloud storage to put Duraspace on top of their clouds
Opportunity for the DLS to be a 'cloud' for Duraspace

Islandora

Very impressive automated workflow for book digitisation including conversion from TIFF to JP2K, OCR, TEI (significant terms, people, orgs etc) extraction and an editor for correcting OCR
Virtual Research Environment including auto DRM
iPhone app for data collection - uses user id to present the correct data collection interface to the user
FeSL better rights management than XACML in Fedora

Biodiversity Heritage Library

Part of Natural History Musem
Similar Archiving Architecture model to DLS. Fedora for metadata access, storage abstracted through Duracloud
Building a large datacentre off J16 fo M6 on old airfield.
500 year business plan
Keen to collaborate with other long-term organisations

EPrints

10 years old
Implemented OAI-ORE and migrated an entire archive from EPrints to Fedora and vice versa
Impressive demonstration showed links to a citation service so you could see the citations of articles in the archive
Offer support to enterprises on a commercial model

iRods

Policy-based federated data archiving
Very impressive but I suspect that deciding what your policies are and designing the workflows to implement the policies would be hard (i.e. It might take years)

Thursday, 9 July 2009

The Myth of Right First Time

The major difference between working in a public institution and working for a private company is a perception that delivery dates are not important. Or rather, that delivery dates are not as important as getting it "right first time". I saw a presentation by Ken Schwaber a few years ago where he started by asking, "When you start a project, what is the first thing you know about it?". The answer of course, is "the date". This has certainly been true of every private company I have ever worked for.
I now work for a major National library and this is definitely not true. The date is hardly ever known. People still talk about time pressure, and there is this vague sense that we are too slow, but nothing compared to what I have experienced elsewhere.
What many people say however, is that it is important, nay vital, to get it 'right first time'.
This manifests itself in a couple of ways: First, most projects start with a huge, costly and lengthy requirements gathering exercise where everyone is invited to give their input and their opinions are collected in the greatest detail. Secondly, there is an emphasis on defining the data that must be collected (the metadata) and the form that it is saved in (usually some flavour of xml).
I have a number of problems with this:

For any non-trivial system, it is not even theoretically possible to completely specify the requirements.
Even if it were, by the time you had implemented a system based on these perfect requirements, time would have moved on and some of the requirements would be obsolete
A lot of the requirements are speculative, based on what we might want to do in the future, rather than what we know we need to do now
No company I have worked at has ever had the money or the time to implement all the requirements of any system. I suspect that not even Google has enough money or people to implement all the things people want them to do.

So what is the solution? Well, we need to set people's expectations right from the start. Go to the customer and say, "We can't possibly afford to build a system that does everything that you might to do, but what we can do is build a system that supports what you need to do. Want we want to do is to get you up and running as quickly as possible. What we'll do is work together to define the simplest thing that can possibly work, we'll implement that and you can start using it. When you've had it for a while you'll know the bits that need improving and we can work together to fix those bits. We'll keep going, improving things bit by bit, until the system is good enough or the business has something more important that it needs us to do."

Do you think it might work?

Wednesday, 17 June 2009

Digital Britain Report

Here's my take on the Digital Britain report. I must admit I am only going to read the summary and not the full report but that puts me ahead of all those who prefer their opinions pre-digested.

Firstly, they've missed a trick on the vision for a broadband future. The way I see it is that, by 2015 or thereabouts, the distinction between wireless, mobile, bluetooth and broadband will have disappeared - whatever device you use will just be connected. It will sort out what is the best way to connect without you having to intervene. I think the Government should facilitate this by announcing a competitive process for the 4th generation mobile network. Rather than simply awarding the spectrum to the highest bidder, it should go to the bidder who has the best combination of price, coverage (geographic, not population - we need to recognise that people actually move around!) and quality of service. If we were to take this approach then there would be no need for a subsidy from Government.

Secondly, the idea of moving radio to DAB-only is pointless. What they should do is move all radio and television to the internet. The digital switch-over should be followed as quickly as possible by a digital switch-off. This would not mean you would need a computer to listen to the radio or watch TV. There are already some radios that connect to the internet via a wireless network. A clear statement from Government that we were going to switch off both analogue and digital broadcast would speed the introduction of consumer products that streamed content from the internet. This would also free up a lot of bandwidth that could be used for the 4th gen mobile network.

Third, I think its interesting that they've essentially given up on trying to prevent people file sharing and downloading non-licensed content. I think this is right - the game is up for the traditional publishing companies and the brighter ones know it. I can't blame them for a moment for trying to protect their business model - I would be doing exactly the same thing if I were in their position. However, they're still going to lose. It's interesting that they've said that they'll go after the pirates who make money selling illegal copies - when you think about it, the pirates' business model is subject to exactly the same pressures as the publishers': who's going to pay for a dodgy copy of a movie when they can download it for free?

Nice to see that they propose to do something about orphan works but disappointing that there's not a peep about getting the legal deposit regulations sorted (six years on and counting).

Taking some of the TV license fee and giving it to Channel4 or ITV is a bad idea. There would be no way to effectively ring-fence it or to prove value for money. It is more important than ever that the BBC exists to provide reporting that the other broadcasters (both in the UK and abroad) won't touch. Whether the BBC needs to be as big as it is, is another matter. If we think in footballing terms, at the moment, the BBC is Man United (with Sky as Chelsea and ITV as... Sunderland, perhaps?). Maybe a BBC more like Everton (well run, punches above it's weight but never winning anything) is all we need as a nation.

All the stuff about the other TV channels is irrelevant. Content production (or acquisition) aligned with a specific delivery mechanism (i.e. broadcast television) has no future in the always-on world we're looking at.

The stuff about internet security is a bit ho-hum and there is a serious whiff of complacency around access to personal data stored by Government. There needs to be a realisation that Government databases are going to be hacked, people's identities are going to be stolen and what they need to do is to recognise this by ensuring:

Personal data is treated like money as far as security goes;
There are strict limits on the amount of personal data that can be held in one place by Government. The data held by any arm of Government should be the minimum necessary.
There should be robust procedures in place that allow people who's identities have been stolen to alert the authorities and to sort out all the mess.

Thursday, 29 January 2009

Wikipedia and personal responsibility

Reading this article about wikipedia reminds me of my time as an outdoor activity instructor about 10 years ago. The governing bodies for canoeing (the BCU) and for mountaineering/climbing (the MLTB) took very different attitudes to personal responsibility in the outdoor setting. The MLTB took the view that you could not hand over responsibility for your own safety and decision-making in the outdoors. Even if you are with someone who is more experienced and qualified than yourself, you still need to take responsibility for your own decisions. So, if you think it's too dangerous then don't go. The BCU, on the other hand would claim that by going out with one of "its" qualified instructors, in a certified setting, that the activity was made "safe".
You can guess which approach I believe in.
Returning to wikipedia, I have thought for over 20 years that the sooner people stop believing things just because they come out of a computer, the better.

Digital Library Thoughts