Friday, 9 October 2009

Notes from Sun Pasig - Fri morning session

A more discursive entry today

Listening to the various presentations over the past couple of days it occurs to me that the kind of solutions that people come up with and the kind of issues they face are infuenced by a number of factors:
  • The problem that they are trying to solve
  • The remit of the organisation that they work for
  • The culture of that organisation
  • The character of the person that is trying to solve the problem.
In discussion this comes out as a tension between people who empasise adherence to standards and planning exactly what you a going to do and how before you do it, and those who are more emphasise being flexible to peoples needs rather than trying to change them and trying to get to 80% rather than 100% (I'm really trying hard not to let my prejudices show here).
To give a couple of examples:
  • The Family Search people are dealing with vast amounts of scanned images of family records. So for them the scale is the problem but file formats and metadata aren't (they're all the same)
  • The Oxford University people are trying to provide services for people who want to run their own archives and to do something sensible with stuff that literally turns up at their door. So for them, scale is a bit of a problem (although nowhere near that of Family Search) but they have a huge problem deciding even how to classify what they have, let alone encoding that in some sort of metadata schema.
The danger is that people push each other apart and we end up with two camps, the Calvinists and the Maeneds (worshippers of Bacchus or Dionysus), shouting at each other from opposite sides of the conference floor.

I think that in reality there is a spectrum of needs and issues and we should acknowledge that each institution inhabits a different range in the spectrum.
I have had a bit of a think and have identified a number of axes that determine the kind of needs that an organisation will have and the issues that the organisation will face. I have also had a go at placing the BL on these axes/planes and showing the direction of travel (that is, this is where we are now and this is where we expect to be in the future):
  1. There is an axis with Linear Ingest workflow at one end (mainly Libraries and Archives) and iterative creation workflow (Universities and Research groups using a publishing workflow or Virtual Research Environment) at the other



  2. There is a triangle with unstructured heterogeneous data as one vertex, structured homgeneous data as a second and structured heterogeneous data as the third




  3. There is an axis relating to the amount of engagement (ceremony) with providers of data before the providers start to deposit. At one end there is low ceremony, where you tell the provider where to put it (could even just be your postal address) and you sort it out, and the other end is high ceremony where you spend months deciding on data formats and metadata standards before you allow the depositer to place a single object.



  4. There is a triangle with document-centric deposit at one vertex, database centric at the second and system centre (e.g. deposit of a web application) at the third



  5. There is an access with preservation at one end and access at the other.




Thursday, 8 October 2009

Notes from Sun Pasig - Thurs afternoon session

Columbia University Libraries

  • Using Fedora 3
  • Two copies on disk, two copies on tape
  • Capacity of 70TB
  • Developed Hypatia, a tool to enable non-programmers to create input forms and workflows for metadata schemas and then catalogue them in a controlled environment
  • Investigating blacklight
P2N - University of Southampton/Oxford University

  • Distributed archiving using bittorrent concepts. Each object is split into parts with redundancy so that parts of the file can be lost without losing the whole file
  • The idea is that, if you want to take part, you install the software and donate half of your storage to the other participants
  • Proof of concept done
  • Use a REST api (put, get, post, head, delete) to access the files
University of Queensland - Keith Webster
  • Fez - Content Management System on top of Fedora
  • Australian academics have to deposit all publications into institutional repositories. Australian Government rewrote the copyright laws to make this possible
  • Using the ResearcherID service (from Thomson Reuters) to identify researchers uniquely
  • ResearcherID shows citations info from Web of Science

Notes from Sun Pasig - Thurs morning session

Planets - Adam Farquhar

  • Planets are establishing a not-for-profit organisation to ensure the long-term sustainability of Planets framework

University of Witwatersrand, Johannesburg - Derek Keats
  • Built a private cloud infrastructure for the entire university that hosts standard services (e.g. e-mail, eLearning portal) and Digital Library/Archive functions

National Library of New Zealand - Kevin de Vorsey
  • Use criterion of ability to render or migrate a file (formats can hide a multitude of sins e.g. container formats like MPEG 4) to determine the level of risk
  • [opinion] Perhaps adoption of file-formats and tools should be a criterion? If a file format is popular then there is a much greater chance of it being preservable.

Family Search - Gary Wright
  • Ingest and publish 2.75million images per week (~30TB)
  • Volunteer-based programme to index images. 100,000 volunteers indexing 1million names per day
  • Want to double ingest rate by 2010
  • Low quality access copy on disk (90% compression)
  • Archive high quality JP2K on tape (50% compressionn over tiff)
  • Simplify rights management - only staff user roles can access archive. Rights management is asserted at search stage so users only see what they can access.
  • Goal to validate each tape at least once a year but not doing it yet
  • Refresh of on-line storage is a looming problem

USC Shoah Foundation Architecture
  • Preserve visual testimony of holocaust survivors
  • 52,000 interviews, 105,000 hours
  • Adding Rwanda genocide testimony
  • 235,000 video tapes to be digitised
  • indexed manually in 1 min segments (people, places events). Indexing takes 2 hours per hour of video at a cost of $25million
  • Store low quality access copy on disk. Volume is 135TB
  • Store high quality archival copy on tape. Capacity of library is 8PB
  • Replace tapes every 3 years
  • All objects are signed and checked at every stage in the process. Biggest problem is network cards flipping bits during transfer!
  • Use physical transfer to bring back video from Rwanda, in chunks of ~140TB
CDL - John Kunze
  • Permanent Objects, Disposable Systems.

Wednesday, 7 October 2009

Notes from Sun Pasig - Weds afternoon session

Sun PASIG

Library of Congress
  • Planning to digitize 1Exabyte of content by 2012. Estimate that this will cost $30m per year in power!
Internet Archive
  • 4.5PByte in the Archive
  • Planning to add 2PByte to the web archive this year
  • Planning to change web crawling policy. At the moment, decide what domains to crawl and then store everything found. In future will crawl more widely and then decide what to store
  • Serve 400-500 requests per second
  • Using Reed Solomon algorithms for replication & restoration
  • Moved from 100MByte Arc files to 1GByte warc files
Oxford University - Neil Jeffries

  • Lots of legacy digital data ~200TB (mostly maps)
  • Lots of discrete archives with different technologies implemented using virtualisation and a common storage infrastructure
  • Acquire digital donations as pcs/disk images/ tapes/floppies
  • Have to archive web servers
  • Have an approach of preserve before curation that is, ensure that the data is stored and catalogue later
  • Charge £5k/TB one-off cost for archiving
  • Use Fedora
  • Restore from backup is unrealistic so don't backup to tape
  • Objects are versioned rather than being deleted/overwritten
Stanford - Tom Cramer
  • 80TB in repository, well over 200TB not in repository
  • Started with preservation focus, now realized that other requirements are equally important
  • Implementing the Digital Object Repository (DOR) using Fedora - similar to DLMACS
  • Moved away from METS to much simpler object model, based on FOXML
  • Took too long to add new content types so they de-emphasised curation and move away from 'Just in case' to 'Just in time'
Hathi Trust
  • Over 4 million items (mostly Google digitised books)
  • Use the OCLC number to identify and deduplicate items
  • Use Aleph
  • Only ingest TIFF, JP2K, OCR text file and METS file
  • QA 1% of what is ingested
  • Uses the Pairtree directory structure
  • Planning to roll out full-text search via Solr
  • Sharded indexes using 5 instances of solr
  • People can build their own collections which are indexed separately
  • Using Digital rights management to allow disabled users (with Shibboleth authentication) to access the full text of in-copyright books
SHAMAN
  • 'Securing Communication with the Future'
  • There are over 500,000 libraries in the world and over 70,000 national archives so digital preservation has a very low penetration into the market

Notes from Sun Pasig - Weds morning session

Intro from Art Pasquinelli:

  • Libraries and IT departments will move to the side in Digital Archiving and Content will be at the centre
  • There is a problem with 'Long Tail Data' - we don't know what we want to keep
  • The cost of power to keep the disks spinning is a hot issue
  • How do you transfer a Petabyte?
  • There is an increasing amount of discussion in companies around how to build 100 year archives
  • Pharmachem industries will push the linking of research data and papers.
Mike Keller

  • Federating large repositories is starting to happen
  • Lots of people are talking about storage in the cloud but are worried about security and Service Levels (or lack of)
  • Discovery is a huge problem and current metadata standards (marc, METS) do not solve it
  • Audit is an issue that more people are becoming aware of - can we prove that the stuff is there and will continue to be there? Should we be publishing our audits?
Thorny Staples - Duraspace

  • Fedora scalability - Sun have just tested a Fedora instance with 150million objects and found that ingest performance was flat over that volume as was access performance
  • Plan to make Fedora more modular and attract new developers to the Fedora community
  • Improving Fedora docs
  • Duracloud - adds integrity checking and other archiving services to the basic cloud storage.
  • They are talking to a number of vendors of cloud storage to put Duraspace on top of their clouds
  • Opportunity for the DLS to be a 'cloud' for Duraspace
Islandora

  • Very impressive automated workflow for book digitisation including conversion from TIFF to JP2K, OCR, TEI (significant terms, people, orgs etc) extraction and an editor for correcting OCR
  • Virtual Research Environment including auto DRM
  • iPhone app for data collection - uses user id to present the correct data collection interface to the user
  • FeSL better rights management than XACML in Fedora
Biodiversity Heritage Library

  • Part of Natural History Musem
  • Similar Archiving Architecture model to DLS. Fedora for metadata access, storage abstracted through Duracloud
  • Building a large datacentre off J16 fo M6 on old airfield.
  • 500 year business plan
  • Keen to collaborate with other long-term organisations
EPrints

  • 10 years old
  • Implemented OAI-ORE and migrated an entire archive from EPrints to Fedora and vice versa
  • Impressive demonstration showed links to a citation service so you could see the citations of articles in the archive
  • Offer support to enterprises on a commercial model
iRods

  • Policy-based federated data archiving
  • Very impressive but I suspect that deciding what your policies are and designing the workflows to implement the policies would be hard (i.e. It might take years)