Wednesday, 7 October 2009

Notes from Sun Pasig - Weds afternoon session

Sun PASIG

Library of Congress
  • Planning to digitize 1Exabyte of content by 2012. Estimate that this will cost $30m per year in power!
Internet Archive
  • 4.5PByte in the Archive
  • Planning to add 2PByte to the web archive this year
  • Planning to change web crawling policy. At the moment, decide what domains to crawl and then store everything found. In future will crawl more widely and then decide what to store
  • Serve 400-500 requests per second
  • Using Reed Solomon algorithms for replication & restoration
  • Moved from 100MByte Arc files to 1GByte warc files
Oxford University - Neil Jeffries

  • Lots of legacy digital data ~200TB (mostly maps)
  • Lots of discrete archives with different technologies implemented using virtualisation and a common storage infrastructure
  • Acquire digital donations as pcs/disk images/ tapes/floppies
  • Have to archive web servers
  • Have an approach of preserve before curation that is, ensure that the data is stored and catalogue later
  • Charge £5k/TB one-off cost for archiving
  • Use Fedora
  • Restore from backup is unrealistic so don't backup to tape
  • Objects are versioned rather than being deleted/overwritten
Stanford - Tom Cramer
  • 80TB in repository, well over 200TB not in repository
  • Started with preservation focus, now realized that other requirements are equally important
  • Implementing the Digital Object Repository (DOR) using Fedora - similar to DLMACS
  • Moved away from METS to much simpler object model, based on FOXML
  • Took too long to add new content types so they de-emphasised curation and move away from 'Just in case' to 'Just in time'
Hathi Trust
  • Over 4 million items (mostly Google digitised books)
  • Use the OCLC number to identify and deduplicate items
  • Use Aleph
  • Only ingest TIFF, JP2K, OCR text file and METS file
  • QA 1% of what is ingested
  • Uses the Pairtree directory structure
  • Planning to roll out full-text search via Solr
  • Sharded indexes using 5 instances of solr
  • People can build their own collections which are indexed separately
  • Using Digital rights management to allow disabled users (with Shibboleth authentication) to access the full text of in-copyright books
SHAMAN
  • 'Securing Communication with the Future'
  • There are over 500,000 libraries in the world and over 70,000 national archives so digital preservation has a very low penetration into the market

No comments:

Post a Comment