ETL Debug Blog

Post to this group blog about your Extract-Transform-Load (ETL) efforts.

Helpful Regular Expression (regex) links

Debugging Python scripts with Pylint

I am currently using Pylint to check for bugs and formatting errors in my Python scripts. I used Brandon Corfman's nice Komodo Hacks: Integrating Pylint tutorial to set up Pylint and how to integrate it with the Komodo Edit IDE (see Easy setup of a Python 2.6 development environment on Windows about setting up Python).

Here is my updated Pylint execution macro for Komodo Edit 5 which includes a verbose and an errors only reporting command.

Converting FGDC XML metadata records for WMS and WFS services to ISO 19139 via Python and GeoNetwork

This post describes a series of Python scripts to convert FGDC XML metadata for WMS and WFS services to ISO 19139 service metadata via GeoNetwork's OGC harvest service. Here are the steps:

  1. Extract WMS and WFS GetCapabilities URLs from FGDC XML metadata and write them to a file.
  2. Create a GeoNetwork Harvest Node from extracted GetCapabilities URLs and let GeoNetwork's OGC WMS/WFS harvester create ISO 19139 metadata from GetCapabilities response. Note: the resulting ISO 19139 metadata is only as good as the one in the GetCapabilities response. In addition, GetCapabilities responses do not include all fields necessary for minimum ISO 19139 metadata.
  3. Copy some FGDC metadata entries (Title, Abstract, etc.) to newly created ISO 19139 metadata.

Our goal is to have working WMS and WFS metadata records in our CSW catalog that can be used to add the described services to analytical software such as ESRI's C-SW Client for ArcGIS Desktop. Following are the reasons why I am currently choosing this convoluted process instead of a direct FGDC to ISO 19139 metadata conversion:

Create a GeoNetwork OGC Harvest Node (WMS or WFS GetCapabilities to ISO 19139 metadata) through xml.harvesting.add request

GeoNetwork offers a harvest service that, among other, follows a OGC WMS or WFS GetCapabilities URL and transforms the response into an ISO 19139 metadata record. Bear in mind that the resulting ISO metadata record is only as good as the GetCapabilities response and that the metadata can never be entirely ISO 19139 conformant without dummy values due to limitations of the OGC GetCapabilities schema.

GeoNetwork offers a handy user interface to add those "GeoNetwork Harvesting Nodes" and also exposes their functionality through a even handier XML service. See chapter 19.3 on "Harvesting Services" of the GeoNetwork opensource V 2.4 The Complete Manual. Following are my notes on creating an OGC harvesting node through GeoNetwork's harvesting service.

Authoritative fgdc-std-001-1998 (FGDC CSDGM) schema (XSD) and data definition document (DTD) URLs

Following are authoritative FGDC Content Standard for Digital Geospatial Metadata (CSDGM) XML schema document (XSD) and document type definition (DTD) URLs for FGDC-STD-001-1998. These documents are used to validate a FGDC XML metadata file.

There are also annotated and modified FGDC-STD-001-1998 schema available for download:

FGDC XML schema woes

I am working on transforming FGDC XML metadata records to the USGIN 1.1 version of ISO 19139 metadata and having a hard time finding formal FGDC schema, name spaces, and schema locations. This is what I found so far:

XSLT transformations in Python through the Gnome libxml C parser

This is an example script on how to do XSLT transformations in Python 2.6 through the Gnome libxml XML C parser and toolkit. The relative fast C library is available on multiple platforms but you will also need Python bindings for libxml2 and libxslt. There is a handy libxml2 and libxslt Python bindings installer for Windows which also includes the C libraries in DLL form.

Import MEF metadata archive into GeoNetwork through Python

This is a Python example script for importing GeoNetwork Metadata Exchange Format 1.1 (MEF) archives to GeoNetwork 2.4.2's mef.import service. The mef.import service requires a multipart/form-data POST through a modified MultipartPostHandler.py library which now supports Unicode (urllib2). It has been tested in Windows XP and Python 2.6.

Create GeoNetwork MEF files from ISO 19139 XML through Python

This is an example Python script that showcases the creation of GeoNetwork Metadata Exchange Format 1.1 (MEF) archives from ISO 19139 metadata XML files.It has been tested in Windows XP and Python 2.6.

GeoNetwork authentication and CSW transactions through Python

 This is an example Python script that showcases GeoNetwork authentication, session handling, and CSW transactions. 

XSLT to Transform WMS GetCapabilities response to CSW Insert transaction XML

Tested with deegree-csw 2.3pre
Read the XSLT file for more information.

Attached is an example XSLT1 script to transform a WMS GetCapabilities 1.1.1 response to a CSW Insert transaction. The script is based on deegree's wms2iso19119.xsl (http://www.deegree.org/).
Note that currently it only supports WMS version 1.1.1 (<WMT_MS_Capabilities>) responses because it chokes on 1.3.0 (<WMS_Capabilities>) responses.

XML ETL presentation at Data Preservation Workshop

Following is a presentation on XML Extract-Transform-Load (ETL) we did at the Geoscience Data Preservation Techniques Workshop, Indiana Geological Survey, Bloomington, IN on July 2009.

Although this presentation was geared towards National Geological and Geophysical Data Preservation Program (NGGDPP) metadata, the same tools can be used for other ETL needs.

Syndicate content