Sharon Tahirkheli (GeoRef)

Please cite as

Profile                Slides

I just wanted to take a brief moment to say that we are here during Earth Science Week. Did anybody know this?  One, two. It’s not exactly a hallmark occasion, but this is a national initiative on the part of many geoscience groups to promote the geosciences to the global community. So I wanted to promote it this morning for a brief second.

I have been with GeoRef, as you have noticed, for twenty-some years. I hate to admit that at times. GeoRef is a bibliographic geoscience reference database. We have currently 2.2 million references in the file. We have been adding references to this file since 1969. We have been in the business of trying to define areas geographically since 1969. Quite a learning experience - maybe we’re just slow learners. We are covering North American literature since 1785 and global literature since 1933.

As I sat through many of your presentations this week, I have been fascinated by the fact that we are all facing the same problems. Geology, of course, is tied very intimately to geography. Everything geologic usually happens in the geographic context. It also happens in a temporal context; I think my definition of temporal might be a little bit longer than the standard definition that most of you are talking about. I tend to think of things in millions of years, so I’m not sure that that would apply in this case.

We found early on that, in trying to define areas geographically, we had several problems. First of all, we had a general need to define things in a political way, so that our searchers or the users of our system could depend on at least one type of descriptor to be available to them. We also found that we had to define things by physical features as well.  Many of our searchers approach the database, and they think of "Monterey Bay", as Nancy mentioned in her earlier description. So we found that we had to be as all-inclusive as possible. We had to include the very specific geographic political entities, the physical features, and we also had to include information in a hierarchical relationship, because the standard search engines available in the early 70s did not allow for any type of sophisticated searching. So whatever the searcher put into the system, that is exactly what they got out of the system. We had to go to great lengths to try to describe our areas with geographic information. I think the slides that Nancy posted up there had something like seven or eight geographic descriptors to one area that we had identified.

We also had a problem with uniqueness. We discovered that there are, I think, 24 Washington counties in the U.S. That is a great big problem if you are searching a database and you throw in everything about Washington County.  (From the audience - There are 52 Springfields - 2 in one state) That's even worst.

Roger Payne - We get that all the time - for some reason people think there needs to be a Springfield in every state. Not even close. Only in 32 states.

Sharon - Well, in any case, we reached a conclusion that we had to develop a series of geographic identifiers that were unique. Each was unique. And I guess we can’t claim that for Springfields anymore.

We also had the problem that a lot of geology relates to the ocean areas. There are no good textual indicators for the ocean areas. So, around about 1977 we said, we are going to have to develop some solutions to this to help our users search our database. So, as I mentioned already, we decided on unique political units. I just want to say that the American Geological Institute is not a government agency. So I am not expecting the police to come knocking on my door. I just wanted to make sure you understand that, since what I am going to show you next is going to freak some of you out probably.

We also decided that the very specific thing that could be used that we thought everyone understood (ha) was latitude and longitude. So we decided to begin defining areas by latitude and longitude.

Here’s an example of what we call our controlled vocabulary: "Fairfax County, Virginia." We have tied a political entity to the first portion of the name, so that it is solid and complete and you know exactly, sort of, where it is at any given time. We also have a brief description. In this case, it is on the Potomac River in Northeastern Virginia. That is sort of a free text kind of description. We also found that at that time, the searchers needed help, because they don’t know where things are, and they don’t know some of the quirks about geography. Virginia is one of those funny little places where you can have a county, but a county can have a city in the county, and the city is not part of the county. The user may not be aware of that, but we certainly cannot put in our system something that is hierarchically part of something that is not.  So we also wound up having to add Search Notes, we call them directions to "also search" a list of areas that we felt needed to be indicated in that way. We have a cut-off point. We say that if something is to be relevant, it has to have been necessary for us to cope with it at least four times in the past. We have a minimum threshold. We also began to use the coordinates, and you see that the variety of latitude and longitude that we decided to use were the boundary boxes. We decided to assign the southernmost, northernmost, easternmost and westernmost corners. That was the direction that we took. There was a beautiful slide yesterday of the hundreds of ways that you can do latitude and longitude, and I don’t think this one was on it.

We also have hierarchical information; so of course the broader term for Fairfax County is Virginia, and it is in the United States.

How do we find all of this information, how do we derive all of our coordinates?  I have a staff of about 32 people who are busy exploring the geologic literature and investigating where things are happening, identifying the geographic locations for each reference. They are embedding these boxes into each reference. They are not necessarily using the ones that we have in our controlled authority file. They use what is actually applicable to each reference. I think we now have somewhere in the 400,000 or 500,000 range of references that have the boundary boxes contained within the actual references themselves.

What would I like to see happen?  That’s what I have been asking myself as I have been sitting here. I would just like to say that, given my experience with the use of coordinate systems in the past - I have 32 people who go out and look at gazetteers all the time - speed is an issue. We have in our shelves back in our office all the old printed volumes from the Board of Geographic Names. A lot of my staff prefer to use paper because it takes them five minutes to find something on the Web. It’s just too slow. So speed is a very important issue.

Another issue is polygons. Your statement there was very interesting about polygons. We have used bounded boxes in our authority file, but we have used them selectively. We have used them for political entities, and we go back and change them when they change. We find that challenging when we don’t know where the political boundaries are, but we find that our users demand that of us, because we are supplying information to them. They search for it, they pay for it as they use it, and they have to have something that they feel they can trust to get the right data. So for us the polygons are absolutely essential. We had a vision that someday we would be able to have a map of the world and focus in and have a little box, and click on that, and say, I want everything about the geology of this area that I have just circled here. We are getting closer and closer to that, but I would really like to see a faster, speedier and more flexible way of getting to that point.

I think I’m out of time.