Parsing HTML Pages with NSGMLS

April 8, 2010 at 8:29 pm Betty 2 comments

James Clark’s SP (http://www.jclark.com/sp/) library can be used to validate HTML files. SP has several useful tools for parsing and transforming Standard Generalized Markup Language (SGML) data. You might ask what “How does SGML relate to HTML“. I am glad you asked!

SGML is an International Standards Organization standard that became a standard in 1986. SGML was used mainly in the publishing world. SGML allowed users to define their own ‘tags’ for identifying information. As a result there were many standard SGML vocabularies that were defined and widely used. HTML was one of these vocabularies. HTML is an application of SGML. Therefore, SGML tools can be used with various flavors of HTML.

SP has several utilities. These tools can be a little daunting to use because they are commandline tools. For the purposes of this article we are only going to deal with parsing HTML using NSGMLS.

NSGMLS is not the easiest tool to use to I will try to walk you through parsing HTML files. At this point you might ask “If NSGMLS is so hard to use, why not just use another HTML validation tool like the W3C HTML Validator”. Again you are asking very good questions. Personally I find SP is useful for validating batches of files. It is very easy to create a batch file, parse large volumes of files and output the results to a report. Now that we have that settled, let’s talk about parsing!

HTML DTD’s

Before parsing HTML files you need to ensure that you have a link to the appropriate HTML DTD. There are many different flavors of HTML and you need to know which version your documents should parse against.

There are two ways of using the DTD to validate the HTML; (1) include a DOCTYPE statement in your HTML file; or (2) provide the link to the DTD in the command to NSGMLS. If your HTML has a DOCTYPE statement it will look similar to the statement below:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
   ...
</html>

XHTML – http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
HTML 4.01 Strict – http://www.w3.org/TR/html4/strict.dtd
HTML 4.01 Transitional – http://www.w3.org/TR/html4/loose.dtd
HTML Frameset – http://www.w3.org/TR/html4/frameset.dtd
And many more. For a full set of available HTML DTD’s refer to W3C http://www.w3.org/QA/2002/04/valid-dtd-list.html

Parsing HTML with DOCTYPE Statement Pointing to DTD

W3C has stopped the capability to actually parse the HTML from their website. Apparently having millions of HTML files trying to parse against a valid DTD on the W3C website was detrimental to the performance of the website.

If you HTML files use the W3C website, you will need to change the DOCTYPE statement to point to a location where SP has access to the DTD:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://mylocation.org/xhtml1-strict.dtd">
<html>
   ...
</html>

If your HTML <!DOCTYPE statement looks like this:

   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Download the DTD from the W3C site and place the file locally. The modify the DOCTYPE statement to point to the

location of the DTD. Here is an example:

      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">

Relative paths work. You can also replace the public identifier with a system identifier but it isn’t necessary:

      <!DOCTYPE html SYSTEM "c:\mydtds\xhtml1-strict.dtd">

If the DOCTYPE statement points to a valid DTD location, then my favorite NSGMLS command is:

nsgmls -f errors.txt -grsuv -wall my.html

I realize this command isn’t very self-explanatory so lets break it down:

nsgmls – the executable program. You can use a full-path to get to the nsgmls command.
-f – this parameter is the location of a file where you want the error report to go. If you don’t use the -f parameter the output is sent to the screen.
-grsuv (you can look these parameters up http://www.jclark.com/sp/)
-wall – Warn about conditions that should usually be avoided (in the opinion of the James Clark).
my.html – the location of your HTML file.

NOTE: If nsgmls cannot find the DTD you will get massive quantities of validation errors. You will receive an error that states “120:E: DTD did not contain element declaration for document type name”. This is a clue that there was a problem finding the DTD.

Understanding the Errors

There will be entity errors for the HTML DTD. You can ignore these errors because there isn’t anything you can do about these. These errors will identify the DTD:

nsgmls:location-to-dtd\xhtml1-strict.dtd:237:27:E:

Errors in your HTML will have the name of your HTML file.

nsgmls:location-to-HTML\myfile.html:7:8:E: end tag for "TITLE" omitted, but its declaration does not permit this

The number after the HTML file (7) is the line in the HTML file that the error was located. The second number (8) is the character. NOTE: In some cases these numbers are approximately close to the error. The E: says that NSGMLS has flagged an error. A W: in this location is only a warning where best practices have been violated.

The error above says that the <title> tag is missing an end </title> tag. The HTML document has the following tag:

<head>
   <title>Title of the document
</head>

Conclusion

The value of validation using NSGMLS is the ability to validate multiple files. I use NSGMLS regularly to parse multiple SGML, XML and HTML files. Normally I create a batch file so that all the files are validated in a single pass. NSGMLS is a little heavy when you only need to validate a single HTML page. I have also created an Omnimark script to create and XML file and an XSLT file to create PDF output of the errors that enable more human readable error messages when providing feedback to non-technical individuals. Alas – that is a blog entry for another day!

Betty Harvey

harvey@eccnet.com

#1 | Written by accumbfruinia about 4 years ago.

It’s really well done! Respect to author.
#2 | Written by Christian Louboutin about 4 years ago.

Great article Thank

you so much!

Betty's Musings

Parsing HTML Pages with NSGMLS

HTML DTD’s

Parsing HTML with DOCTYPE Statement Pointing to DTD

Understanding the Errors

Conclusion

Leave a Comment