next up previous
Next: Loading of Datasets Up: Collections Previous: Collections

Collection Loading Strategy

The loading strategy developed for building the initial collections of ADL to be made available over the WWW involves the loading of both information already in digital form and information being digitized as part of the ADL project. Conditions that materials already in digital form should satisfy in order to be eligible for loading into ADL include the following.

  1. The materials should currently be in the collections of the MIL or the new ADL clone at SDSC. Since the long-term objectives of ADL include the current objectives of the MIL, this is a useful criterion.

  2. In general, the materials should be unencumbered in terms of intellectual property rights. Nevertheless, we will attempt to load some materials that are encumbered to demonstrate our ability to handle intellectual property rights issues, as in the case of items for which the collection of fees is mandated or where data has been licensed to UC, as in the case of California SPOT imagery.

  3. In terms of the choice of collection items, we believe that, in order of importance, the materials should initally be chosen to:

    1. serve particular target user groups within the context of ADL;
    2. be of value as ``experimental'' datasets for ADL's research, such as evaluating how users respond to digital items of various classes or observing what unexpected uses are made of geospatial information when it is made available in digital form;
    3. satisfy criteria for general use, as long as ADL does not unnecessarily duplicate other collection efforts.
    In particular, the experimental datasets could be used to exploit ADL's comparative advantage which is the use of spatial footprints to make surprising linkages among information items. Conceptual examples of such datasets are the Bible, the works of Shakespeare, and information on palaeoenvironments.

  4. the materials should ingestible into ADL at relatively low cost.

We believe that an important initial focus for ADL's collection should be information that supports basic science, including the Earth and Social Sciences. We have therefore begun the process of identifying important datasets in certain areas of science, such as seismology (see below.) It is also clear that an important goal for ADL is to support collections in a manner that allows users to integrate diverse information concerning specified geographic areas.

We have identified several candidate datasets that each satisfy at least some these criteria. These datasets, together with their appropriate sizes, include:

  1. items of value to specific communities, such as
    1. Sierra Nevada Ecologic Project data: 3 GB;
    2. Mojave Ecologic Project data: 4 GB (this database is currently under construction,)

  2. Experimental datasets, such as
    1. Two dates of scanned aerial photographs of Santa Barbara County and the associated texture/color characterizations of features in the photographs: 30 GB.

  3. Generally useful items, such as
    1. US Geological Survey Digital Raster Graphics (topographic maps, production in process): 30 GB;
    2. US Geological Survey Digital Elevation Models at 3 arc seconds and 1 arc second for California: 17 GB;
    3. US Bureau of the Census TIGER and STF files (in vector format); 10 GB
    4. US Geological Survey Digital Line Graphs: 10 GB;
    5. ETOPO5: 0.6 GB;
    6. Digital Chart of the World (DCW): 2.5 GB;
These datasets have an aggregate volume of approximately 100GB.

In addition to the items in the preceding loading plan, the UIE team at UCSB has been developing a desired loading list for items, that was developed in discussions with three classes of ``target'' user groups. These target user groups (TUGs) included panels of earth scientists, information intermediaries (including librarians), and education specialists. The collections identified in some cases overlap those in the current ADL loading plan. Others will have to be considered as resources become available. The questions to which the user groups responded are discussed below. We present a small sampling of the materials that were indicated as useful by these groups.

Materials of general interest
  1. APSRS (Metadata base);
  2. extended Gazetteer (extraterrestrial items?);
  3. external sources of data such as the GeoRef thesaurus;
  4. spatially indexed text, such as geoindexing of MELVYL catalog;
  5. user-definable content / local library group/user contributed data;
  6. the ability to index/query other sites
  7. travel information or catalog links to existing sites such as Excite and Yahoo;
  8. content to support full GIS functionality online;
  9. National Geographic/Britannica (and other packaged sets of information suitable for classroom use).
Materials of interest to an Earth Science target user group

  1. Digital Orthoquadphoto Maps (DOQs);
  2. CIA maps;
  3. Soils;
  4. Air photos - historical;
  5. High-level index to regional holdings; e.g. National Spatial Data Inventory (NSDI), CERES, ...;
  6. Faults/seismic: coarse and fine;
  7. Precipitation data;
  8. GeoRef metadata;
  9. Land use and land cover maps (LULC);
  10. Knowledge of and connection with other spatial data sites and projects.

Materials of interest to an information intermediary target user group

  1. An interesting subset of MIL's aerial photography with catalog records;
  2. Census statistics (US and foreign) ;
  3. Historic maps (pre-1950?; pre-1900; world history; local area (2,2). E.g. Times Atlas of World History) ;
  4. A selection of historic texts (e.g. History of Mongolia);
  5. Foreign country 1:200K/1:250K topos.
Materials of interest for an education target user group
  1. "Here is what is available": synoptic view of ADL's holdings specified in terms of geographic footprint;
  2. Historical (dates): place name lineage;
  3. Types of data: cultural (maps for race, religion, custom, dress, food, etc. - with pointers to other sites when not in ADL), historical, economics (industry, earnings, agriculture), population (demographic information for Census);
  4. distribution of income in city of SB; shading to show population density, political, geology, land forms: roads,
  5. local rivers, mountains, water source and consumption, flora and fauna ;
  6. Directions: e.g. from Santa Barbara to Los Angeles; produce a road map ;
We note that a significant proportion of these materials are currently not available in digital form, and would have to have been digitized before they could be loaded into ADL. The digitization of a significant collection of these materials would require a larger scanning capability than currently exists within ADL. In relation to production of digitized materials and their ingest into ADL, for example, the current daily rate of production of digital materials is only 1 GB/day from the digitization of airphotos and 0.5 GB/day from the downloading of AVHRR data, giving a current collection growth rate of 1.5GB/day. Nevertheless, we believe that an important component of the ADL loading strategy in this regard should be, at least initially, to digitize only the most interesting subsets of large analog collections.

Because much of the relevant material is not currently in digital form and because of our currently limited ability to scan such materials, the project has focused significant attention to the issue of increasing the growth rate of ADL's collections of maps and images from an augmented scanning capability. As part of this activity, the project investigated the scanning of analog geo-spatial materials, such as maps of various sizes and imagery in various formats. The MIL does not currently support production level scanning of spatial materials, and current scanning technology consists of a small flatbed unit supported by a PC workstation hooked to the MIL internal network.

A successful proposal was presented to NSF for supplementary funds to investigate content-based, intelligent searching of images based on texture patterns and visual thesaurus. This subproject involved ADL, the University of Arizona, and the University of Illinois. Since digitized imagery was needed to support the research, it was decided to use images scanned from the MIL collection as part of the database. The images did not have to be scanned at a preservation level (approximately 5-10 microns), but at a 600dpi gray-scale level. MIL committed its one scanning station (a JX-610 scanner connected to a networked PC) for the 90-day experiment. University of Arizona researcher Dr. Hsinchun Chen hired a local person who had GIS and systems background to scan the images, write computer-scripts for file management, determine geo-coordinate information, and generate metadata. ArcInfo is used to generate geo-coordinate footprints and ADLs Access metadata form is used for cataloging.

The scanning started on January 1997 with procedures that were written to include a set of scanning instructions for the black and white National Aerial Photo Project imagery. The following statistics characterize the first month of the scanning operation:

  1. each scanned black and white image file is approximately 32MB when scanned at 600dpi;
  2. total time for scanning and file management is 15 to 19 minutes per image, not including footprint generation;
  3. extrapolating to an 8-hour work-day, approximately 1GB is generated;
  4. extrapolating to a 12-hour work-day, approximately 1.5GB is generated.
One person could run two scanning stations configured as above, in which case, the output would double to about 3GBs per day. Extrapolating these figures to an annual rate of production, approximately 750 GBs of data could be scanned and loaded. We note, however, that ADL currently has only 500GBs of available storage (375GBs of disk and 100GBs of optical.) Hence if two scanning stations were operated at 12 hours a day, it would only take about six months to fill the existing storage array. Excess full-resolution data could be stored to tape temporally with only browse images and metadata made available through ADL.

The ADL project is currently planning an augmentation of its digitizing capability based on the preceding analysis.



next up previous
Next: Loading of Datasets Up: Collections Previous: Collections



Terence R. Smith
Thu Feb 20 13:50:53 PST 1997