Author: Betty Harvey
Email: harvey@eccnet.com

Introduction

The purpose of this article is to provide some insight into using and converting Microsoft Excel© spreadsheet for authoring codelist and using XSLT to convert the codelist to genericode.   Genericode is an OASIS-Open specification for encoding codelists in XML.   Genericode is easily consumed by applications.
I have done many conversions to and from Excel 2003.  One of my conversions is a schema analysis spreadsheet that creates a spreadsheet from a W3C schema.  An Excel format provides an easy way for non-technical people to understand relationships and definitions of schema components.

Another ‘goto’  conversion that I have created and is Excel 2003 XML to Genericode. This conversion is currently in place in a production environment.  This conversion is fairly straight-forward.  Many of the codelists have empty  cells.  When Excel saves a spreadsheet to 2003 XML is does not output the empty cells.  This is true for both 2003 and 2007 XML.

For the purpose of this article I will use the following simple table that represents an Excel authority list.

Table Example

My Code List
Column A Column B Column C
a1 c1
a2 b2 c2
b3 b4

Why Not Continue Using 2003 Excel XML

Conversion from Excel 2003 is very straight forward and relatively easy. However, it does require that you must open the Excel file and save the file as a 2003 XML file. If you have many  codelists this mechanism can time consuming.  For example the project I wrote this XSLT application for has approximately 30 codelists they maintain.

I searched for a tool that would provide a batch conversion capability for Excel. I wasn’t able to find one – I am sure there must be a tool somewhere but I couldn’t find one.

Excel 2007 is native XML application. The file xslx (and docx) is just a zip file that contains file folders and XML files. This enables conversion to happen without physically opening the files. Unzipping the Excel files in batch mode is possible and running XSLT in batch mode is possible. Life is good!

Genericode XML Structure

Before beginning any conversion project you must have an understanding of the starting structure and the ending structure.

A simple Genericode structure is pretty straight-forward. Genericode can get complicated but for probably 95% of the codelists that people use it is relatively straightforward. A simple codelist is broken into 3 separate sections. :

  • Identification
  • Column Specifcation
  • CodeList Data

Below is a graphical view of the XML Schema for Genericode.

Excel 2007 Unzipped File Structure

Excel 2007 XML isn’t quite as straight-forward as Genericode.  In fact IMHO the structure is rather bazaar.  As stated before the xlsx file is really a zip file containing multiple files.  The structure of the internal xlsx (zip) file is:

Root

  • _rels
  • docProps
  • xl
    • _rels
    • printerSettings
    • theme
    • worksheets

There are many XML files in each directories.  For the conversion we won’t need to concern ourselves with all the files.  Quite a few of the files are used internally by Excel to control the Excel application.

We only care about the data!!!

The files that we are interested in for conversion are:

  1. /docProps/app.xml – Obtain organization information
  2. /docProps/core.xml – Obtain metadata for time and author, etc.
  3. /xl/worksheets/sheet1.xml – contains mappings to the data
  4. /xl/sharedStrings.xml – contains the actual data

There is logic to the files but I have to admit that it was “twisted mind” that defined the logic.

Excel Data Mapping 101

The sharedString.xml and sheet1.xml file work together.  Lets look at the sharedString.xml file from the example above.  The file structure for this XML file is pretty straight-forward:

<sst>
   <si>
     <t>My Data 25</t>
   </si>
   <si>
     <t>My Data 10</t>
   </si>
...
</sst>

The interesting fact about this file is there doesn’t seem to be a logical order to how the data is included in the file.  The file is basically a flat listing of all the cells within the Excel spreadsheet.

The sheet1 file provides a mapping to the data. Below is the structure of the sheet1 file:

<worksheet>
   ...
   <sheetData>
      <row r="1" spans="1:3">
          <c r="A3" s="1">
               <v>22</v>
          </c>
          <c r="A1" t="s">
              <v>0</v>
           </c>
      </row>
   </sheetData>
</worksheet>

The meat of the data is within the <row> element.  Let’s delve further into the row element.  It is pretty obvious that the tag names are really intuitive and can be understood at a glance.

row The beginning of a row in the Excel spreadsheet.
@r Row number
@spans The number of columns the row spans.  The first number is the beginning, the second is the end.
c Information about a cell.
@r The physical location of the cell, for example B10 would be column B the 10th row.
@s=”1″ The actual value of the cell.  In this case a lookup is not required into the “sharedString.xml” file.
@t The datatype, for example ‘s’ represents a string.  When the @t is available, then the value of the <v> element is used as a lookup into the “sharedStrings.xml”
v Contains the number of the <si> +1  where the actual data is located in the sharedString.xml file.

The value in the element is the XPath number (+1) of the <si> element in the file sharedString.xml.  I am not sure why they start with the first <si> being 0 but it does confuse the issue.

Developing the Stylesheet

You must select an XML file as your base XML file. I chose to use the sheet1.xml file because it contains the mapping to the data.  The data file (sharedStrings.xml) is static and it would be hard to deduce intelligent navigation from this file. As stated above Genericode has 3 main containers of information.  We will walk through the conversion for each of the containers.

Housekeeping – Global Variables

You will need to do lookups in several files.  Rather than using the full pathname of the various files I found it easier and more readable to create global variables.

This parameter sets the root directory for the XML files.   If you are converting multiple codelists using a batch job, you can send the value of the parameter on the commandline.

 <xsl:param name="ExcelRoot">MyCodelist</xsl:param>

The following variable is used to get the complete path and filenameof the file that contains the information.

<xsl:variable name=”DataFile”><xsl:value-of select=”$ExcelRoot”/>/xl/sharedStrings.xml</xsl:variable>

Root Template

The <worksheet> template is the starting point for the conversion.  Below is a sample of the template for the <worksheet> element.  Basically, my beginning template calls the conversion of each of the different sections.  The file that you want to start the conversion is the sheet1.xml file.  The namespace prefix is ‘Excel’.  You can change it if you prefer.  The root element is <worksheet>.  You will see that my template matches “Excel:worksheet”.

Beginning Template

 <xsl:template match="Excel:worksheet">
 <CodeList xmlns="http://docs.oasis-open.org/codelist/ns/genericode/1.0/"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns:era="http://archives.gov/era/schema"
 xsi:schemaLocation="http://docs.oasis-open.org/codelist/ns/genericode/1.0/
 http://docs.oasis-open.org/codelist/cs-genericode-1.0/xsd/genericode.xsd">
 <xsl:call-template name="commentBlock"/>
 <xsl:call-template name="identificationBlock"/>
 <xsl:call-template name="createColumnSet"/>
 <xsl:apply-templates/>
 </CodeList>
 </xsl:template>

You will also want to create a default template to suppress unwanted Excel specific data:

Default Template

<xsl:template match=”*”/>

For the purposes of this article I am only going to concentrate on the body of the of the codelist (<SimpleCodeList>) because once you have this logic developing the other sections are a piece of cake. Each organization will have their own methodology for Excel codelist templates.

Creating the <SimpleCodeList> and <Row>

The <sheetData> element is a direct child of the <worksheet> element.  The <SimpleCodeList> element is sparked from the <sheetData>.  The <Row> element is sparked from the <row> element. Really straightforward.  I don’t think it could be any easier.

<xsl:template match="Excel:sheetData">
 <SimpleCodeList>
 <xsl:apply-templates/>
 </SimpleCodeList>
 </xsl:template>

<xsl:template match="Excel:row">
 <Row>
    <xsl:apply-templates/>
 </Row>
 </xsl:template>

Creating Column References

Genericode requires that column references be used for every entry in the Genericode.  This information is captured in the <ColumnSet> element.  As stated before I am not going to show how to create the <Identification> and <ColumnSet> entries because they will be different depending on the template that an organization uses to maintain their codelists in Excel.  However, the column reference information is required to populate the @ColumnRef information in the <Value> element.  Notice in the table example above the second line is used to define the column headings.

I have manually set the variable for each one of the headings.  There is probably a more elegant way of setting these variables but this mechanism works.  I create a variable for columns.

In the code below I am selecting the value of the <v> element in the A2 cell.  Remember because the numbering starts with 0, we need to add 1 to the <v> value.

 <xsl:variable name="ColumnALocation">
       <xsl:value-of
       select="//Excel:worksheet/Excel:sheetData[1]/Excel:row[2]/Excel:c[1]/Excel:v + 1 "/>
 </xsl:variable>

Next we need to get the actual value of the cell.  In this case (1) we do a test to make sure the column location isn’t null; (2) if the value isn’t null then we do a lookup in the sharedStrings.xml file where the position axis (position()) equals the value of the value in the <v> element in  sheet1.xml.  This variable will be used later on in determining the @ColumnRef attribute.

 <xsl:variable name="Column-A">
 <xsl:choose>
 <xsl:when test="$ColumnALocation = ''"/>
 <xsl:otherwise>
     <xsl:value-of select="document($DataFile)/Excel:sst/Excel:si[position() = $ColumnALocation]"/>
 </xsl:otherwise>
 </xsl:choose>
 </xsl:variable>

Creating Values – Heavy Lifting

If we look back at the table example above you will see that column A1′s value is ‘a1′.  We want our Genericode representation to look like below:

Example Genericode snippet

<Value ColumnRef="ColumnA">
  <SimpleValue>a1</SimpleValue>
 </Value>

We need to set 3 variables.  The @r attribute in the <c> element contains the cell coordinate.  For example A2 represents column A and the 2 row.  The table example above shows that are column headings are A2, B2 and C2.  You will need to know which cells the title and column headings are located.

The 3 variables:

  1. column = column number.  For example B represents the second column.  The @r value contains this information.
  2. dataLoc = data location.  This is XPath + 1 position of where the data located in the sharedStrings.xml file.
  3. cellValue = Actual data of the cell.  This variable is set with a straight lookup in the sharedStrings.xml file.

After setting the variables then we start populating the data.  The second part of this process is to populate the @ColumnRef attribute.  This could be accomplished in the same template but I chose to create a second template and send the column variable to find the name of the column.

Example ‘c’ template

<xsl:template match="Excel:c">
  <xsl:variable name="column"><xsl:value-of select="substring(@r, 1, 1)"/></xsl:variable>
  <xsl:variable name="dataLoc"><xsl:value-of select="number(Excel:v) + 1"/></xsl:variable>
  <xsl:variable name="cellValue">
           <xsl:value-of select="document($DataFile)/Excel:sst/Excel:si[position() = $dataLoc]"/>
   </xsl:variable>
  <xsl:choose>
     <xsl:when test="Excel:v[@s='1']">
     <Value>
          <xsl:attribute name="ColumnRef">
             <xsl:call-template name="getColref">
                <xsl:with-param name="column"><xsl:value-of select="$column"/></xsl:with-param>
             </xsl:call-template>
          </xsl:attribute>
          <SimpleValue><xsl:value-of select="."/></SimpleValue>
      </Value>
     </xsl:when>
     <xsl:when test="Excel:v[@t='s']">
        <Value>
          <xsl:attribute name="ColumnRef">
              <xsl:call-template name="getColref">
                <xsl:with-param name="column"><xsl:value-of select="$column"/></xsl:with-param>
              </xsl:call-template>
          </xsl:attribute>
              <SimpleValue><xsl:value-of select="$cellValue"/></SimpleValue>
        </Value>
    </xsl:when>
    <xsl:otherwise/>
 </xsl:choose>

Get Column Reference Name

The last piece of the puzzle is to obtain the column name.  There is probably a more elegant mechanism for accomplishing setting the element names but this method is pretty straight forward.  As long as you don’t have codelists that have many columns then this method is adequate.  Most codelists tend to be in the 2 -5 column range.

<xsl:template name="getColref">
  <xsl:param name="column"/>
  <xsl:choose>
      <xsl:when test="$column = 'A'">
           <xsl:value-of select="translate($Column-A, ' ', '')"/>
     </xsl:when>
     <xsl:when test="$column = 'B'">
         <xsl:value-of select="translate($Column-B, ' ', '')"/>
     </xsl:when>
     <xsl:when test="$column = 'C'">
        <xsl:value-of select="translate($Column-C, ' ', '')"/>
    </xsl:when>
</xsl:template>

Conclusion

Owners of codelists tend to be business users who are familiar with Microsoft tools.  Business users can update and manage codelist data in a tool they know fairly well.  Using XSLT to translate the business requirements from the technical requirements makes perfect sense and with very little pain the workflow to enable separation from business subject matter experts and technical needs works very well.

If any part of this information is unclear feel free to comment and I can refine the article.

Betty Harvey (harvey@eccnet.com)