Doug Nebert (USGS, FGDC) - NSDI and Gazetteer Data

Please cite as

Profile                        Slides

I work with the U.S. Federal Geographic Data Committee Secretariat, which is in Reston, Virginia, housed at the U.S. Geological Survey. I am going to give a brief overview of what FGDC and NSDI, two acronyms right there - Federal Geographic Data Committee and the National Spatial Data Infrastructure - are about.

The National Spatial Data Infrastructure is an inter-organizational community designed to promote and share geospatial information. It started out with a primarily federal focus to allow federal agencies to advertise the fact that they are collecting information, sharing it with others, so that there was no duplication of data collection. It has expanded significantly to embrace data collection at all levels, including state and local government, and private sectors. With shrinking budgets, everybody has to spread their resources farther and collaborate in funding cost sharing and creative means to collect and organize information. We have a secretary and staff of around 12 people in Reston. We have many stakeholders; probably four or five hundred people consider themselves members of the Federal Geographic Data Committee. We have 15 federal agencies, 30 official state coordination councils with GISs around the states. The National States Geographic Information Council is a formal liaison member, as is the National Association of Counties and the National League of Cities. So if you extend that pyramid down, it means that there are something like 20,000 municipalities or local governmental units who fall under the umbrella of FGDC activities.

One of the main ideas of the NSDI is to try and conform where possible to international consensus practices. We want to leverage as much as possible the things that are out there by adopting standards.

I am going to go through some building blocks of the NSDI. A central one is metadata, which is the descriptive properties of datasets. We have all heard a little about metadata. You could think of these as the fields that you’re cataloging and the fields that can be searchable, the fields the end user can use to evaluate the quality of spatial data. We have two styles of spatial data, and I was a little distressed to hear someone say that there are only seven layers of data in the NSDI, when in fact they were describing the framework data layers that currently are being specified. In reality the NSDI is probably 98% populated by these other things. These are datasets that people have described and collected in ad hoc scientific projects in communities around the country that don’t adhere to a specific standard. So one of the goals of the NSDI and FGDC is to make it easy for people to publish these datasets, describe them in consistent ways, and make it so that other people can find them. Even if they don’t adhere to a standard, it is better to be able to find these data resources, compare them and use them, along with the principle of truth in labeling. We don’t have a targeted scale, we don’t have a targeted content necessarily; we have an interest in making it so that people don’t collect the same data more than once.

So our two boxes here: Framework would be standard themes or common based themes that people can use and specify; a minimum collection target for things that you’d see pretty much on a topographic map. And what I’ll call geodata: the geodata would be anything else that’s geospatial that can be packaged and distributed, whether it’s an image, GIS dataset, or tabular dataset with geographic references in it.

The access mechanism to the entire NSDI is through the Clearinghouse. The clearinghouse uses metadata as a proxy, as a surrogate for the information stored in the dataset of the framework. In fact the metadata may be stored in and with the data. In a GIS that is managed in a relational database there is no reason that the metadata and the data would need to be separated. As an artefact of cataloguing we often separate the metadata in one place, put it in a catalog, and the things we are pointing to, although they are digital, might be somewhere else. We are starting to see these things merge together.  But we see the metadata as well known public fields that can be queried, that can be packaged and presented back to the user. And they may be synthesized from or derived from the data fields themselves.

Wrapped around and touching all of these boxes are Standards. If we don’t have some conventions for how we describe the datasets which is our metadata, or how we package framework data, what structures are present, or how the clearinghouse protocols operate, then we are not going to be able to grow a network that will be interoperable.

And underneath it all or connecting it all, in a kind of the matrix, is Partnerships. We rely on people voluntarily participating in the NSDI, contributing effort, making these linkages work, collecting the metadata, even though it is a pain, putting it into the clearinghouse, making it searchable.

So the clearinghouse is more than just a catalog, it’s really a distributed collection or federation of searchable servers that hold FGDC or similar metadata. We have several instances around the world now where they are not using FGDC metadata, but they have mapped the attributes of FGDC metadata to their own local attributes. A good example is the 14 servers in Australia that use ANZLIC Metadata. They call it dataset name; we call it title. Well it is actually attribute number 4, and the search goes through. When you get the results back, it looks different than you might be expecting in the U.S., but it still works.

The same thing can be done with DIF and other metadata formats. There are over 150 servers online world wide that all conform to the same search interface. So there is an API, a programming interface, that uses the library protocol known as Z39.50 for catalog access. These metadata point to data characteristics and we will want them to include handles to data order or download. So in the metadata, usually prominent up near the top of the metadata record, is a URL that the user should be able to click on to gain access to that spatial dataset. Now it might be a proxy in that, if you click on it, you get an order form because it is a terabyte dataset and I can’t send that to you over the Internet, or it is copyrighted, or it is packaged in a certain way where you can only get it on CD-ROM. So we see approximately 20% of the datasets that are being described in the clearinghouse worldwide are directly downloadable because of their size and lack of restriction on re-distribution. The other 80% require some kind of a reference negotiation; they are for sale, they are packaged or copyrighted. It might be a misconception; people might think that the FGDC or the NSDI is only for free data. It is not. We want free discovery of information, so the users can compare available data over the same geography to see which data suits their purpose. We do have several commercial data providers that are online in the clearinghouse, selling their data but offering their advertising of the metadata for free.

In 1994 the FGDC published a Content Standard for Digital Geospatial Metadata, the CSDGM - an acronym, not easy to spell and it does not sound like anything. It includes over 300 fields and structures to describe spatial datasets in great detail. Most of them are optional fields. They are there if you have that property to describe; it is better that you put in a predictable container than invent it or put it into other notes into a general bin. So a lot of people will look at FGDC metadata and say, "oh, it’s too big and too scary", when in reality what you are doing is a progressive description of the information.

If you have spatial coordinate reference systems, then there are bins to fill that stuff in; if you don’t, you don’t. If you don’t have source information to describe, you don’t describe it. If your only distribution information is online, you just put a URL. If it’s not, you need to describe things in more detail. The interface should allow you to give progressive levels of detail.

Metadata is not used just for cataloging or discovery but also to help the user in an appraisal of the resource they are about to get. Say, is this appropriate for my use? And finally, when you get the dataset, it should tell you something about its source information, where it came from, what it should and shouldn’t be used for, relevant scales or details that might be pertinent to the end user. Kind of like the instructions inside the box when you’ve gotten something. Which is much further than most cataloging systems go.

So in the metadata, spatial datasets are discovered by geographic coordinates, and we assume these bounding coordinates are polygons to be expressed in latitude and longitude decimal degrees. That is the conventional data form. So for search it is easy - no matter where in the world your metadata record needs to transform its bounding coordinate rectangle into lat/longs, so we can reliably ask questions of where it is across all of the clearinghouse sites. Spatial datasets can also be discovered by the time period of content, either a beginning and an ending date, or a specific date of coverage, in the case of an image, certain date and time. There are fields with enumerated domains or pick lists that you can search against, such as the spatial data format. There are also many fields with free text, where the encoder is encouraged to put lots of gory detail, tell a story about the dataset and how it came to be. The narrative is very important. And finally we support not just search against the fields but also full text search. So one can search for a word in full text and falling in a geographic region and falling within a time period of content of interest. So you can build a very complicated query or a very simple one, depending on how many records you would like to get back. It is important to see here, and relevant for the discussions of a gazetteer, that for all metadata entries right now we expect them to have a geographic footprint or bounding coordinates, because that is one of the primary discovery mechanisms in the clearinghouse.

The FGDC metadata does also support the use of thesauri in keywords for place, theme and time periods, so you can put a word in and reference a thesaurus. But what is the most popular thesaurus?  Why, it’s this one called “None”. And this doesn’t really help. So what people do is, they see that field, they figure they want to put in keywords, they put in whatever words they want and they say, "oh, it’s not by any authority", which is just as good as not putting in any words at all because there are no bounds to what somebody might use. Is it about hydrography, hydrographic, water, water resources? - there is no discipline on this. So we can’t build a pick list from it, we can’t use it, it’s not a controlled vocabulary. And because geography is the intersection of a thousand fields, we don’t have a single, agreeable thesaurus that we could use in theme or place or time. And in an international setting this gets even worse, as you can imagine, with many languages, linguistic nuances of different words. So use of thesauri is difficult, but people say they want it, too, at the same time. So this is a conundrum.

One thing that is happening in the metadata world is that the ISO TC211, which is a special committee working on geospatial standards, is building a standard for geospatial metadata. It was influenced by or contributed to from the FGDC metadata standard, the ANZLIC metadata standard, the European CEN TC282 standard, and NATO’s DIGEST/DGIWG contributions. There are many sources of metadata that are being built, many styles of metadata being built around the world, and they’ve coalesced into a draft international standard which is in committee draft right now. Its new number, in case you didn’t know (it used to be 15046), is 19115.

One of the things relative to gazetteer discussion is that people were complaining, mostly from the user interface point of view, “well, I don’t want to type in the coordinates for my dataset”.  Well, I say that is a user interface problem. Your GIS should be volunteering them for you when you are coding that stuff up. I just want to type in a place name and be done with it. And it was the consensus of the working group that you either have spatial reference by identifiers of place names or you can put in the coordinates of reference. For search we want the controlled vocabulary idea; we want to be able to have some rigorous tie between the place names and their location. So if you do use a thesaurus or the equivalent of a gazetteer, if you say this is really this place and there can be an unambiguous translation to some kind of a coordinate, then our searches can continue to operate in the ISO TC211 environment. If not, our searches are going to become less reliable. We won’t have a coordinate to go and search on, and we will have ambiguity in how that place name gets transferred. Future encoding of place will require us to translate these place names into coordinates for use in the clearinghouse and similar kind of services.

What would be really handy for someone putting together metadata or somebody using metadata in user interfaces from a software point of view would be an online service protocol on the Web that would let people query geographic name servers or a hierarchy of these servers to assist in relevant place name and coordinate assignment. If somebody had a place name and some context, you could get a whole bunch of structured information back that could be used for human presentation or, more importantly, for presentation and use by software to go and do something interesting, like put a dot on a map, show where that place is, get some feedback from the user. Metadata collection, then, cataloging, whether you are inside a GIS package or you are in your own metadata collection tool, would benefit from access to this Web service, so that we would get place names and coordinates, populated where possible, to flesh out or fill out the metadata record. On the flip side, on the user side, when you are looking for things, your interfaces could benefit also from stronger match of place to coordinates, so that you could do searches by place names and you could get the search actually passed in coordinates to all these servers around the world.

The general context of interaction, then, is that a user with some kind of a Web-based application, whether it’s a metadata collection tool or some kind of a search interface, would volunteer a place name in some context, say a country or a ZIP code, as the earlier example was given, and pass it to this equivalent of a DNS sitting out there, oh Omniscient Geographic Name Service, let’s call it a GNS, and they would get back some kind of a structure that would say, well, here are one or more possible matches based on the query you gave. Now right now there are many different interfaces that exist out there for different gazetteers that are home built basically. It doesn’t mean that they are bad, but right now the specification for what I asked for and how I get that back is different. So I have to build a custom interface to each gazetteer kind of collection. Moreover, each of those gazetteers doesn't know about the others. They can’t interact in a predictable way or forward requests if something isn’t known at the highest level. Like Cliff Kottman was talking about a creek that flows behind his cabin in West Virginia, but is not on the map, so it is not in BGN, so it is not in GNIS. He knows it’s there, and he’d like to have a place where he could push this in as a well-known place name, even if it doesn’t have much of an authority on it. Wouldn’t it be nice if there was a fail-over to say, well no, I don’t recognize that place name, I will pass you to another authority who could pass you to another authority until eventually it is resolved?

This is actually how the domain name service works - the DNS on the Internet. Every one of the routers out there doesn’t know about every IP address in the whole world. If it doesn’t know it, it passes it to another one and another one until it gets resolved. This is the notion of a “resolver”, a geographic names resolver. So from an encoding point of view, I am sitting in my metadata entry system, you might actually go to the service that cascades down through a bunch of known services that are linked in predictable ways. Perhaps they are national authorities or international authorities that take a top rung, and they pass through. If they don’t know that term, they pass through to other ones. Some kind of a result or multiple results might get passed back up to the name service, and then encoded into a structure, not just an HTML picture, not just a pretty file but actually a structured chunk of information, probably an XML, that can come back to the encoding system and then do something with you. Say there are four or five things that came back, what would you like to pick? Which one of these five is the one you think best matches what you were talking about?  You pick one, and then it would push it into the metadata for you. Only the salient fields it needs. It doesn’t need everything in that gazetteer record; it may only need a translation of the coordinates, a name, and an authority.

The same thing would go for a search. If I am searching the clearinghouse, if I am searching the master directory, I would like to be able to give it a place name and context out of the blue, not just from a pick list but out of the blue, and have a GNS again help me resolve what it is I mean, offer me some choices, say, “yeah, that’s the one I mean, I would like that”, then to push out to search one or more distributed search servers. So I could go and ask those places by their coordinates, because they all speak coordinates, based on something I saw as a name.

So what’s needed?  We need a protocol; we have to pick a protocol. We want to do this something on TCP/IP. We need a syntax for the request. In the spirit of the Web mapping test bed, where they did this "getmap" - basically a URL with a lot of parameters packed on to the end of it. What are those parameters?  Which ones are optional, which ones are repeatable? How do you spell them? How do you separate them and delimit them?  What are the things that you could request of a GNS?  And how do you package that request?  And then when you get the thing back, how is that thing coming back going to be structured? What chunks are going to be in it?  Are they ordered?  Are they optional?  Are they repeatable?  And then further, what is the packaging of the response? It isn’t just the structure coming back. It might be that you would like to see that structure presented as HTML, or you would like to see that structure presented as XML. Or you would like to see it as full text, or you would like to see as an Access database file. There are any number of ways you could package it. The packaging is separate from the structure. You are saying, how do you want it actually exported to you?

We probably also need an ability to certify or authenticate responses. What is to keep someone from spoofing an official name of something?  We want to know that this is just the official name. We want a certificate. We want some kind of a digital signature, too, that we could check to and find out that the BGN said this. Well, how do I know they actually said that?  Why didn’t somebody get in the middle somewhere and intervene in that response and generate something else? This is particularly important in a cascade of authorities, where you are handing off authority from one to another system. You would want to know exactly which one it came back from.

We finally need the DNS-like implementation that can traverse these trusted hierarchies of name servers, eventually and possibly cascading down or accessing public servers, where you have donations of people who live in communities who have place names that they want to push into the GNS. They might not have any other authority except that they are on the Net. So we need the support for authoritative and public or donor name services on the Internet. And these are for places that are not necessarily the traditional place names we would see in a GNIS or a gazetteer. These could be features, such as businesses, and there would be great commercial appeal to be able to push somebody’s business into a gazetteer, just like GO2 was talking about.

We need agreement on feature type definitions. We saw a classification crisis yesterday. Well, if we don’t agree on what these feature types are or have a very tight translation of them, a GNS and a cascade is not going to work well. What are these named things? What are the classes of named things we can agree on at a very coarse level?  The grouping of the features into themes is probably less important as a classification issue. I would suggest that you might want to focus on populated places or locales first and then work into more complex structures.

You will also want to work on the formal semantics for relationships among features. That is going to be an important one for exploiting the geographic richness of information, that things are adjacent to, or upstream from, or near. What do those things mean in the context of a search on the Internet?  Again  we are thinking beyond just the places into real world things.

So what are the challenges of levels of place?  Traditionally we have political subdivisions. Those aren’t too challenging. They nest in pretty nice hierarchies and there are lots of gazetteers that do those kinds of things. There are some places where you can find natural features. Some gazetteers include places like mountains and rocks and headlands and such. We also need streams. We need to know not just that there is a stream; we need to know what stream flows into that stream. There is a topology of streams, globally; it would be very helpful to know this. Which Rock Creek is it that you are talking about?  In one state you can have hundreds of Rock Creeks, and without some context there, you might have ambiguity in the discovery of things. We need what Scott was talking about, location by address but not just along roads. We also need things like you find on township grids and land surveys. There is an awful lot of information out there that has a legal description to it. And it isn’t quite yet in a GIS in most cases. Your property that you own is in a legal description. It is in a parcel description and it has got some meets and bounds associated with it. It doesn’t have a latitude and longitude and it may not have state plane feet. How do we geolocate those kinds of things?  Again, one of the services of a GNS could be to translate some estimate of, “you say I am in Township 5 North, 3 West, Section 12; what does that mean?” And return a latitude and longitude or some kind of a polygon that defines that edge. And finally, navigation along rivers. I say I am 12 miles above the confluence of X and Y, on such and such a stream. Where is that?  And what are the things around me?  A specialized GNS could respond to that kind of a query, too. And a lot of these things are more inventive, and they are building on the capabilities of integrating GIS with traditional databases. And finally relative position, or bearing from: I am a block down from the church. It is that kind of thing, but there a lot of things that are across the street, that are 20 meters north of the bus stop, and so on.

And finally, when we think about multiple dimensions, some people, particularly in the hazards community, want to worry about phenomena. They would like to classify information that occurred in space and time as a named phenomenon. So if you have hurricane X, you would like to know where it was and when it was in order to discover information associated with that phenomenon that was not classified with that phenomenon. You want to know its spatial and temporal footprint, so you could query its spatial and temporal footprint elsewhere in a GIS. So encoding phenomena could be a GNS kind of service. If somebody had said, "here is the extent in time and space of this phenomenon", they could push that into a GNS, and have some kind of a reference to it.

To do this, we are going to need standards. I would suggest using XML for packaging, with a DTD, which is kind of a structure file. There is an emerging XML schema activity in the World Wide Web consortium to define even more advanced structures, including binary structures. We should use the ISO/ANSI thesaurus structure standard. We should reconcile the ISO TC211 feature catalog that was discussed yesterday. And behind it all, to make something like the DNS work, the GNS is going to have to use the Internet conventions, kind of “best of show” from the IETF or W3C to make it work. (IETF = Internet Engineering Task Force, and W3C is the World Wide Web Consortium.)  They are both organizations, more so the W3C these days, in terms of making and developing rapid agreement on standards on the World Wide Web. IETF is the engineering group that put together some of the basics of the early Internet. And I would also offer that we might consider the Open GIS consortium as an implementation venue, where vendors and implementers could get together, discuss what it should be, and put it out as a formal specification. Because we have an interest in the geographic community that might not yet be recognized or shared in the World Wide Web Consortium. It is not big enough yet. But as a standard it will allow us to possibly put a specification together.

Your roles collectively: I don’t really care about a gazetteer per se.  I see that as the meat that would help drive a service. I don’t care about the structure of the gazetteer, as long as I’ve got coordinates and some kind of place name and an authority. You need to work on that. But it has got to be in a discipline that we can dock it to a service, so that you can interface your existing name servers against this as a back door or a side door. You can still keep your front door interface with the Web interface with the selections and things that you already do.

Some of the most important things would be to build policies for name authorities, hierarchies, authentication, and Net-wide participation in a global context. And not just in a traditional context. But to also embrace public or donor kinds of naming services that are going to happen. People like to play on the Internet, they like to play on the Web, and they want to contribute things. They are going to do this whether you like it or not. We are going to have to have policies to know how to net these things together between public donors of spaces and the well-known authoritative name spaces.

That’s my presentation. Questions?