Googley Metadata

We can all agree that search engines like Google have really defined the way that we search for information on the internet. So the question is, why aren't we using Google to search for our geologic datasets? Do we have to do all that formal, structured metadata? What the hell is CSW anyways?

As an experiment, I built a new site: metadata.usgin.org. This site is buit with Node.js, a really cool way to build a web server by running server-side javascript code. Its so easy. Here's my code. So what does it do?

It acts as a proxy server. You ask it for http://metadata.usgin.org/record/a386a4ba-e892-11e0-9e4a-0024e880c1d2, and it reads that file identifier out and issues a CSW GetRecordByID request to a CSW server. The response to that CSW request is an XML document, and nobody (including Google) really cares about that, so the server formats the record as a very simple HTML page.

So, for any record in our CSW catalog, you can make the request to metadata.usgin.org and it will show you something nicer than XML. But still, if you want XML, you can tack ?f=xml onto the end of the URL and you'll get your XML. You can get JSON too: ?f=json.

But what we're really experimenting with here are Google search results.

For starters, the Node.js application automatically generates a sitemap.xml file for all the metadata records in the CSW catalog. Sitemaps help search-engine robots know what URLs on your site should be indexed. Right now, building the sitemap.xml file is a tightly-coupled process requiring that you're using an ESRI Geoportal to provide a CSW service. When a search-engine crawler bot goes to http://metadata.usgin.org/sitemap.xml (don't go there! you don't want to look at it!), Node.js queries the Geoportal database to find all the metadata file identifiers that are out there, and then automatically generates the sitemap.xml file from that information.

Maybe more interestingly, this is an opportunity to experiment with microdata, which is basically some additional information to plug into your HTML pages that helps search engine robots understand better what your page is about. To some extent, the microdata on your page can influence the way that your search results appear in Google. Because this environment gives me really tight and simple control over the HTML content, it is an ideal way to play with microdata and learn how it influences search results.

What is the end-game? Well for starters, I would like to be able to search the metadata catalog by going to google.com and searching for things like "site:metadata.usgin.org arizona thermal springs", or "site:metadata.usgin.org long valley caldera". Later, we may be able to implement a Custom Search Engine, or maybe even find out if we can use Google to provide structured responses to search queries...

Comments

vision for data and service search

srichardAzgs's picture

Where I'd really like to see this go is the ability to use the commercial search engines (google, yahoo, bing, etc...) to index metadata that we (as data providers) expose using standard metadata formats and present search results in such a way that a search client can find and access data from within a user's application. The search client wouldn't be any more complicated that the standard file open dialogs (with a search box) currently presented by standard GUI interfaces-- maybe with an option to access a map-based interface for geospatial data.

The idea is that from within a web application like USGS the national map , or ArcGIS online, ESRIs ArcMap, MS Excel, ModFlow, GOCAD, or whatever data analysis or map browsing program you're using, you can search for data (pick your search engine --like OpenSearch enables...), get results, look at descriptions of what you found to pick the one you need, and 'get' it. Bingo, you have a new worksheet in your workbook or layer in your map layout.

To do this the search result has to return machine actionable links that client software can parse. The links need to include sufficient information that the client can pick the link that accesses a service it knows how to use and get data in a format it can use. There are a variety of specs out there for smarter links (Web Linking RFC-5988, XLink, ESIP service cast, OGC OWS context documents, ISO19115 CI_OnlineResource, atom:link...), we just need some convergence on what properties need to be associated with a link to make it machine actionable...

Here's a proposal for the properties of interest, based on the specs mentioned above...

href

the link URL-this is what the client will actually ‘get’

title

free text to label link in GUI

type

MIME type of response. Specifies encoding and optionally the native software application environment

rel

URI from IANA rel vocabulary for consistency with IETF5988. Semantics of link from global vocabulary for interoperability. Semantics in this context means calculatable (see discussion in Coyle, 2010 p. 19).

purpose

analogous to ISO19115 CI_OnlineFunctionCode. Controlled vocabulary—tells client why they’d use this link. Purpose property provides mechanism for more granular, application specific indication of link semantics.

protocol

ftp, http, …. Default is http if attribute not specified

serviceType

URI that identifies a service protocol and version (one attribute or two?). This specifies protocols for network layers above http/ftp etc. Should be a URI that can dereference to some kind of service specification document.

outputscheme

profile for content of message retrieve by href URL; URI for xml schema or JSON scheme, other description of data structure and content? Need conventions, clear definition. Clients look at this to pick the link that will get a representation they can use. This is the information scheme in the layers on top of the MIME type encoding; note that the same output scheme might be encoded using different MIME types, so the two are somewhat orthogonal. MIME types have been conflating this property with the lower level encoding (.vnd, +xml … stuff).