Ingest Procedure: Outline
Alexandria Digital Library
Ingest Procedure: Outline Version
Metadata from raw to ready-to-search
Overview: Metadata will arrive into ADL's hands
in various formats, cleanliness, style, and type. We seek to encourage
participants to format metadata before we receive it. Yet, some
transformations will need to take place on most metadata before
the metadata is ready to be inserted into the Catalog. This document
makes an attempt at outlining the process of ingesting metadata.
1. Metadata, solicited or unsolicited, will come to ADL. We will
actively encourage participants to format and structure data so
that it arrives in relatively working order. When appropriate
ADL will provide forms and explanations to participants. Perform
preliminary crosswalk on the metadata.
- 1.1 Obtain description of dataset, with accompanying documentation.
- 1.2 Prepare crosswalk based on the dataset's fields and how
they match into ADL. Add fields if necessary.
- 1.3 Prepare a MS Access form for participant to use.
- 1.4 Deliver MS Access form to participant with explanation
on how to enter metadata.
2. Whether metadata comes to ADL in a MS Access form or in plain,
ASCII files, ADL will parse and further organize the metadata.
If necessary (if the participant did not/could not enter metadata
in the proper format, e.g.; date) run calculations on data, format
with correct delimiters, and organize into proper fields. Revisit
original crosswalk; update to reflect the metadata in hand.
- 2.1 Clean metadata in one/many of several ways.
- 2.2 If we got the metadata in MS Access, do manipulations
and cleaning inside the program.
- 2.3 If it is more efficient to work in a Unix environment
rather than in MS Access, export the metadata to an ASCII file.
- 2.4 Parse/clean/change the metadata with perl, script, vi,
or some other mechanism.
- 2.4.1 Convert degrees, minutes, seconds into decimal degree.
- 2.4.2 Convert time/date field into proper format.
- 2.4.3 If necessary, add fields. (e.g., browse files, file
locations).
- 2.4.4 Remove unwanted characters such as delimiters, "/",
", etc.
- 2.4.5 Delimit fields with accepted character ("|").
- 2.5 Generate footprints if not already available. See following procedures.
- 2.6 Update crosswalk if necessary.
3. After formating the metadata as much as possible, ADL will
then place metadata into some raw table structure in the ingest
database. Right now, that raw database is called "ingest."
At this point, the metadata may need additional manipulations,
transformations, or corrections.
- 3.1 Create tables in Ingest in order to distribute metadata.
- 3.1.1 These tables are not given s_schema table-names. Rather,
mostly only one table with all the metadata (and maybe some lookup
tables).
- 3.1.2 Field names in this raw table typically match the original
field names from the participant.
- 3.1.3 Create fields as close as possible to desired datatype.
Use int, varchar, and numeric in fields taking these datatypes.
- 3.2 Copy parsed metadata (with "bcp" function) into
raw tables in database.
4. Data checking will now take place. Q/A the data again, inside
the tables. When the metadata reaches a "clean" state,
it will then be placed in the s_schema tables within the "ingest"
database. At this point, the metadata should be formatted, correct,
free of error, and matching field definitions. If not, return
to data checking (Q/A) until the metadata is clean.
- 4.1 Q/A metadata now in the intermediate database. Revisit
crosswalk.
- 4.2 Determine how to sort out parent records if applicable. Create parent/child records.
- 4.3 Decode coded fields.
- 4.4 Attach MIL local call numbers to records.
- 4.4.1 Does MIL have 90% or more of items? then attach local call number to all records; remove from records when user requests item not held by MIL.
- 4.4.2 Does MIL have 89% or less of all records? Then match MIL shelflist against dataset.
- 4.4.2.1 Determine from shelflist or actual items which MIL holds. Enter this information (and whatever else may be needed, e.g., area) into a database software such as Access.
- 4.4.2.2 Determine if any of fields, or any combination of fields, in the dataset match or compose the MIL local class number.
- 4.4.2.3 Use database resulting from step 4.4.2.1. Match against metadata dataset. For any metadata that match an item held by MIL, have local call number entered in that metadata record.
- 4.5 Distribute metadata from raw tables in ingest to s_schema
tables in ingest.
- 4.5.1 Distribute directly from raw table into the s_schema
if dataset is rather small.
- 4.5.2 Or distribute the metadata with "select into's"
into tables that have the same structure as the s_tables (e.g.,
s_alex_mds_man, x_alex_mds_man) and then bcp the information out
from x and into s.
- 4.3 Q/A metadata again.
5. When metadata is ready, distribute from ingest into the "prod"
database.
- 5.1 Distribute the metadata from ingest, with bcp, into the
s_schema in the prod database.
- 5.2 bcp all the tables out into ascii files with unique field
and row deliminators to be ready to copy into another database
(e.g., Illustra, Oracle).
6. Document above steps.
7. Backup databases regularly onto tape.
Alexandria Digital Library
Last modified on 1996-10-16 at 19:54 GMT by the Alexandria Web Team