Thank you. My current area is primarily digital libraries and I am involved with both the NSF-NASA-DARPA digital library project on the Berkeley campus, which is now in its second funding round, and I am also the PI on one of the new NSF international digital library grants where I will be cooperating with a lot of folks in the UK.
My perspective is as an information retrieval researcher, and so I am primarily concerned with the IR aspects of digital gazetteers for text retrieval and indexing and also for retrieving other types of materials which I will lump under multi-media retrieval - everything from images to various videos, whatever you can imagine. There are things that may have places associated with them, and for a digital library we are combining all these different types of information into that digital library. It would be very nice to be able to index them and retrieve them, based on geographic information. The term I refer to this by is GIR, or Geographic Information Retrieval, which I think I coined. There was discussion a few years back on the Geolist of what are we going to call this stuff, where we are using the World Wide Web and everything now to do searching based on geographic stuff, and I suggested, why not just call it GIR - Geographic Information Retrieval. That has become a very important component of digital libraries these days.
I was very pleased to see the freebie copies of Distributed Geolibraries. I had already bought mine from Amazon, so I had it here to show around and of course they beat me to it.
The issues that we are dealing with in digital libraries are: we have distributed resources of the digital libraries for various purposes scattered all over the place; there will be users of those from everywhere; and services will be provided by different ones. What Doug was just talking about is a service we are in dire need of. So access to a very broad population is something that is critical for digital libraries. You can’t say that this is a defined user population and that they will have capabilities of understanding, for example, where Sri Lanka is well enough to point it out on a map. So georeferenced information is one of the organizational aspects of digital libraries, but only one. There are other common perspectives, such as topical classification, or temporal and historical organizations. Projects like the ECAI (the Electronic Cultural Atlas Initiative), also run out of Berkeley by Louis Lancaster, which is attempting to put together eons, centuries of cultural information with their geographic locations as well as their temporal perspective, have very important needs for gazetteer sort of information, but that gazetteer has to include not just today’s place names but also the place name 5,000 years ago. So there are many issues in that area.
Digital libraries can give you multiple views of the same information from any one of those perspectives. So we are pointing at a very large and broad user base. There are going to be varying levels of expertise in the content. People who will be using these digital libraries will range from school kids doing little reports to scientists doing their day-to-day scientific work. They may be referring to the same content to attempt to do that. There are also varying requirements for access methods. Some people want to just browse through World Wide Web, others may want to use some sort of protocols to retrieve large bodies of information they can then integrate with datasets for their own analysis, and so on. There is also the very difficult case of a simple expression of interest, something that I want to know about some place or some thing; the system has to have some capability for dealing with natural language expressions. One of the issues is mapping from these natural language expressions to controlled vocabularies, and that includes gazetteers. That is an area of research that I have been working on with Professor Buckland and some others at Berkeley for some time. We have some pretty good methods that work for taking people’s normal expressions of how they’d ask for things and mapping them into controlled vocabularies, like the Library of Congress classification scheme, like the INSPEC thesaurus, and so on. We want to do the same sort of thing with gazetteers.
In digital libraries we have various needs. One is geographic and spatial querying. Another is spatial browsing. There is also geographic and spatial indexing. We have to index this stuff as well. I probably won’t have time, but I will show you some examples if we do.
In geographic and spatial querying, what we are talking about is basically search by location, search by place names, or search by subject, theme and time periods. All of these are important. In my international digital library project we are dealing with folks who have encoded archival information from archives in the UK. This includes everything from the complete cathedral records of Durham Cathedral, which go back and include everything from property transactions in the 1300s to the war journals of famous generals, and so on, in the Sudan campaign. There are things that people want to find and locate, and sometimes these are not things that have been nicely georeferenced in any normal sense, but they will have place names and they will have ways of identifying places.
In a more detailed model, we are talking about geographic and spatial querying. Both of these really imply querying on relationships within a particular coordinate system. Spatial querying is really the more general term. Remember, some digital libraries are not geographically bound. Some of them will be the NASA folks. They will be talking about other planetary bodies as well. You can define these as queries about spatial relationships, like intersection, containment, boundary, and adjacency, and proximity for entities that are geometrically defined and located in some sort of space. Geographic coordinates are those geometric relationships where you can actually measure things and do calculations to determine things, so you can figure out what 5.21 miles north of Crystal City means. You can also have spatial relationships that may be topological in nature as well. So, "inside the Beltway", which has been referred to several times, is kind of a vague notion in our minds, but it’s really more a state-of-mind than a geographic location. Or probably in the directions there may be some mention of the little round building on the right side of the Smithsonian Castle. Well, that’s not exactly a geographic coordinate. That’s more topological. It depends on which way you came out of the Metro to decide what’s left or right.
The types of spatial queries we look at, and are concerned with, are really the point in polygon sort of thing. Somebody can click on a map at some point and say, "What have you got about this spot?" And what they really mean is "What have you got that kind of surrounds this area; that encompasses this point?" There are also queries like "What about this area?" "What about this region?" "What kind of things that are point encoded do you have there?" "What are, for example, the city names included in this space?" "What kinds of lines or borders cross this or intersect with it?" and "What areas might overlap with it as well?" The areas might be layers in a GIS context that are about particular thematic content. Other types of queries are distance or buffer zone things. "What is 40 miles within the border of Northern and Southern Ireland?" That sort of thing is very common when you look at city planning issues, and so on. "What’s 200 yards left and right of this?" There are also path queries. These are following particular established paths set out in the geographic domain. "What is the shortest route from San Francisco to Los Angeles when you are driving?" You have to take roads; therefore you are following a particular path that is not just the geographic distance between San Francisco and Los Angeles. There are also questions of putting together lots of different types of information based on a geospatial reference. If you want to know the names of the farmers that are affected by flooding in Monterey and Santa Cruz counties when we had some rivers flood not too long ago, first you need to zoom in to find the area you are talking about, the areas that were actually flooded. Then look at the cadastral information to find the property lines and so on; look up the current records to find out who are the people who are affected, and so on. So you are combining all types of information, and it may not be just a single database. One point of access is having to have a gazetteer that will at least get you to the areas that you are interested in.
When you add a temporal constraint, which is very common for things
like ECAI where you are looking at historical issues, you are combining
all the previous stuff with temporal information as well. You have constraints
that depend on an area which may not be accurately defined and a time period
which may not be accurately defined, and the question "What do you know
about that four-dimensional space?" There are many issues in temporal querying
that are also important.
Browsing, of course, is what we see a lot of these days, particularly
in Web-based applications. That is sometimes referred to as the hyper-map
concept. Look on the map, zoom in, pan around, and so on. When you get
to certain levels, perhaps other information will appear. One thing that
has been done with this is to geographically place other sorts of materials.
So you get down to a certain level, and little icons or whatever appear
that say, "OK, this is the book that talks about this particular spot".
So there are many things that can be done with this. We have really just
started to look at it.
There are lots of advantages in spatial browsing. You may not need the complete accuracy that you have with the full GIS. You have a comprehensible search metaphor for many sorts of materials, but there are problems. If you are really trying to show everything that you have in a very large digital library in a georeferenced sense, you have clutter. You may click on a spot and the system will show you every level of geographic information that covers the entire world - every book about the entire world, every picture that can be said to be about the entire world, and so on. So we’ll put the icons for those on the screen. That doesn’t work. You have to have things so that your level and your perspective are considered at the same time as you're displaying the information. So you may only be concerned with the book about the Battle of Gettysburg when you are down to the point of the battlefield itself. You may not care if you are just looking at the southern half of the United States. What you need is good, and preferably accurate, geographical indexing, and you also need to have some way of deciding, either algorithmically or by setting them manually, at what level of coverage a thing really is of interest and should be displayed.
The other issue, of course, is indexing. You have to put this stuff into the database somehow. Traditional geographic indexing has involved things like indexing books by adding place names from the Library of Congress subject headings and their name authorities. Of course, all the problems we have already heard about names are not unique. Places referred to change size, shape and so on over time. There may be spelling variations. Some places are temporary conventions, such as study areas. We have already heard about most of these.
So geographic coordinates are a natural solution to much of that. They are persistent, regardless of name, political boundary or other changes. They can be simply connected to spatial browsing interfaces and to GIS data, and they provide a consistent framework for geographic information retrieval applications and for spatial querying. But geographic names are and will remain the primary entry vocabulary for digital library spatial queries. People, remember we are talking a wide range of people, often do not know their geography well enough to look at a map and say, "well, I know where Germany is, so I can click here and find it." You may have seen in the paper some tests with school kids when they asked them to point out countries on a map, and so on. I even had a colleague with a Ph.D. not long ago writing a report. He was looking at economic data and noted that there were oil exports and arms imports to the Sultanate of Brunei, so he naturally assumed it was in the Middle East. They export oil, they import arms, they must be in the Middle East, and so he put that into a report. I pointed out to him that there was a little problem there. So it’s not just school kids. A lot of people who should know better often don’t know their geography well enough to just look at a map and say what country is here when it is not labeled. But they often will know the name. So he could have typed in Brunei into the Alexandria gazetteer and found out a lot of information about it. Querying has to support spatial reasoning as well. When we are talking about those sorts of things that we have been looking at, well, this is a little north of that, or this is within the approximate area of that, you have to be able to use a little bit of spatial reasoning based on the gazetteer information and other geo-information that is in the system or accessible by network access.
One thing we worked on originally in the Sequoia project, and it has been carried on some since and a new version is now planned, is GIPSY, which is the automatic georeferencing of text. The idea behind GIPSY is to take a document, a piece of text, and you feed it into GIPSY. What it does is extract names from the text in an attempt to identify the coordinates of those names and the places discussed in the text, using the combination of evidence. What we have used, of course, is GNIS as the primary basis for it, as well as some G-RAS information. The idea is to associate place names with coordinates. The essential feature of this is you identify places as well as you can. Often there are discussions of multiple places within text, and they will usually have some sort of geographic proximity if it’s a focused discussion on a particular area.
We used examples of reports from the Department of Water Resources in the State of California on the Santa Barbara area, talking about dams and the various streams in the Santa Barbara area, and so on. Each time you have a mention of that area, its approximate outline is determined. We essentially stack that as evidence; consider it stacking on top of the maps. So you are building sort of an elevation map of potential relevance to a particular area, and the resulting map is then analyzed to find out what are the areas most probably being talked about. So in effect, for that document talking about Santa Barbara, we had a map which looked like this that said, well, there are other mentions of things that were similar up there in Northern California, but the primary areas being discussed were here. We ended up with suggesting that this approximate footprint would be the area being discussed by this document.
We will look a bit at some of the UC Berkeley Digital Library stuff. This is the testbed that we are dealing with. We have a lot of people participating in the project. Our testbed has been built over the past eight years (not quite that long). It started with the Sequoia 2000 project and continued through others. It is a quite diverse collection of material relevant to California’s key habitats. The users, and there are real users of this data, include some State agencies, development corporations, some private corporations, and so on. It has basically had an impact and been a prototype for the CERES System, where it is being used today. The main users are the people of the California Resources Agency and the State Department of Water Resources, the California Department of Fish and Game, and so on.
The kinds of stuff we have in the database are environmental technical reports, bulletins, and so on. These are all received in paper form, scanned and OCR was run on them. We also include a complete set of county general plans for the State of California. This includes everything from air pollution control to noise abatement to everything else. But all of them, again, are scanned as paper documents, OCR’d, and included in the database. There is a large collection of aerial and ground photography. A lot of this comes from the Department of Water Resources. They have pictures of all the dams and so on in the State of California, and how they were constructed, and fields of wild flowers, and all sorts of things. We have included all kinds of information. The collection is about three-quarters of a terabyte of data. That includes 70,000 digital images, 300,000 OCR’d pages of environmental documents, and over a million records in geographic and botanical databases.
The botanical information is some of the most interesting. This is referred to as CalFlora. It started out as a single collection of photographs of California plants, which were taken by a Brother of a religious order who traveled around the state and took pictures of the plants. He wrote down where he took them, and so on. Well, we put those up; they were donated, we scanned them and put them up. People who were doing botanical research were saying, "Oh, wow, I found a picture of this plant by looking it up. Hey, can you guys link my data to these things, because I am doing all my research on these sorts of plants and where they are, and here’s all my sitings information." Slowly other people have been adding to this. It is now one of the largest botanical information centers for Western America, or it is the largest, I guess we should say. So we have loads of information about plants in the State of California. We are now starting to add animals as well, linked to photographs, linked to site data, and so on. All, certainly the carefully collected scientific data, have very accurate site annotations. Some of the other things, such as the early photographs by the Brother, weren’t quite so accurate; it would just give the county or whatever it was in.
We also have a lot of geographic data. This includes the Tiger files, a chunk of GNIS, and information from the State of California on dams. We also have lots of other information. We have a GIS viewer that has been constructed in Java that allows you to view this information and overlay it. This also includes digital orthophotoquads and other information from the San Francisco Bay Area. As mentioned before, there are about 300,000 pages of digital documents. This includes all sorts of printed material. All of this, you might note, is in the public domain. Part of what we tried to do in this digital library project was to use entirely public domain materials, so we didn’t spend our entire budget and time negotiating with lawyers on whether we had a right to show anything. There is a lot of information there, all State information. We also have some things from the World Conservation Digital Library that we are linking to the CalFlora and other information.
There are 17,000 images of California natural resources and 17,000 photos from St. Mary’s College (that’s our Brother). The California Academy of Science has donated photos as well. We also use about 40,000 Corel stock photos, primarily for the vision researchers who are in the group who are trying to figure out how to do better Tiger finders and things like that.
There have been some success stories in this. One is LUPIN, which is the Land Use Planning Information Network. Since we have all the county general plans and other information, this has become a widely used resource within the state. You not only have your general plans, but you can find out what your neighbors are doing as well.
Also, when we had El Nino a couple of years back and there were floods going on all over the state, we discovered that one of our databases was one of the few things online that allowed you to find which dams were on which particular streams, and therefore who was in trouble downstream if there were troubles with that dam.
CalFlora, of course, I mentioned. There are folks who are using our information. Our services at the California State Library are starting to be used; there are some folks at the FBI who have started to attempt to use some of this for the Freedom of Information Act stuff.
So we have a research agenda of understanding user needs, extending functionality to documents, enlivening legacy documents, and so on.
I want to go quickly through some of this. We have what we talked about as the multivalent documents. This is a new approach to putting together documents so you can view them in a different way. This is my image of it. It is considering documents not just as any single page, or whatever, but composing various layers and pages, just as you would in a GIS, where you may have a scanned page image as your primary document. You may then have an OCR layer, and you may have a mapping between the OCR and the page documents. Now that mapping allows you to do things like a search, and have the results highlighted on the scanned page image. There are lots of interesting capabilities. We also have particular behaviors that are built into these things, such as table behaviors. If you have a table in your document, you can identify it and scan it out. You can then do things like sort that table by different columns and so on. What was formerly just a scanned page image now becomes a live document. Your scanned page image now has a picture of a map in it. You’ll click on the map, and this is mostly done by saying, "OK, here are the basic coordinates and outlines for that map"; you will then be taken to a GIS layer, which will bring up a browsable version of that map that you can pull in other information on. There is also a layer that is linked to the network, using information retrieval protocols, like Z39.50 and a new one we are working on called the Digital Library Information Exchange Protocol.
Another thing you can do is use the same techniques on HTML. Here’s one of those scanned page images with a map. We can pull out that map, put it into the GIS viewer, and then start overlaying other sorts of layers on that scanned map, and do that to whatever level that you want to. If you have it, even pull in orthophotoquads and so on, and overlay them on top of the information. There are also capabilities for annotation and so on.
I mentioned the dams within the State of California. That started life as a paper document, which was a standard reference work for people in the Department of Water Resources. That was scanned, OCR’d with some special OCR software which allowed us to do things like pull out all the coordinates of all the dams and lay them on a map, so that if people wanted to know about a dam associated with a particular county or area, they could look at it that way. We also had a relational database that you can look at.
Well, I think I am out of time.
Lola Olsen - Yes, you are. Thank you very much.