Gawk XML extension

Introduction

The XML extension for gawk (the GNU implementation of awk) is a very useful tool at times you want to quickly extract specific data from an XML data file. Awk is a scripting language in which it is easy to process data that is stored as text. By default, awk processes the text file line-by-line. Awk carries out specified tasks triggered by specified patterns. The working of the XML extension is very similar, only it processes XML files node-by-node. A simple example of the use is given below.

Since 2012 the XML extension is part of gawkextlib: the gawk extension libraries. gawextlib is written by Andrew Schorr and Jürgen Kahrs. The home page for gawkextlib is http://sourceforge.net/projects/gawkextlib.
The predecessor, XMLgawk, was written by Jürgen Kahrs, with help from Stefan Tramm, Manuel Collado and Andrew Schorr. The XMLgawk home page is still maintained at http://gawkextlib.sourceforge.net.

A couple of years ago I started to create my own binaries for XMLgawk, since I found the concept very useful and no binaries were available that run on native Windows. The method to build XMLgawk for Windows is still available on the page Building XMLgawk.

End of 2012 and May 2013 I made some attempts to build the gawk extension libraries on Windows. The last effort was successful, see Building gawkextlib. The binaries for the XML extension are available in the Downloads section.

Example

Consider the XML file VectorDb, which represents an index file to a very limited geographical database of the Netherlands. The following simple script will extract the paths and file names of all files contained in the index file.

Running the command:  gawk -f Parse-VectorDb.awk VectorDb.xml
gives the following output:

Explanation of the script

With the command @load "xml.dll", the script loads the XML extension and gawk works in “XML mode”. The expression XMLSTARTELEM == "file" will execute the following code block when a node with name “file” is entered.

With path = XMLATTR["path"], the script variable path gets the value of the attribute “path” of the xml node “file”. The next line has a similar function. Finally, the expression XMLENDELEM == "file" executes the print statement when the node “file” is left.

The full possibilities of XMLgawk are explained in the XMLgawk manual.

Downloads

The zip file contains gawk.exe version 4.1.0 and all binaries that are needed to run gawk with the XML extension on a pc with a standard Windows installation.
Download: gawkextlib_win_v05.zip

Installation

Generally, we have the following options for the file locations:

  • All binaries from the archive except xml.dll need to be in the current directory or in a directory which is contained in your path (e.g. in c:\WINDOWS\system32 or in c:\Programs\MinGW32\bin) and
  • xml.dll needs to be in the current directory or in a directory which is in AWKLIBPATH or in a directory defined in deflibpath. On my system it is in c:\Programs\MinGW32\lib\awk.

You can set AWKLIBPATH yourself by issueing:

set AWKLIBPATH=[your_path]

deflibpath is defined in gawk.exe, with current definition:

.;c:/MinGW32/lib/awk;c:/Programs/MinGW32/lib/awk