Discussion of the EPCES DTD as relating to the Pilot Conversion of an Illustrated Parts Catalog

Author: Derek Millar, SoftQuad Inc.

Date: Dec. 10, 1996


Introduction

An Illustrated Service Parts Catalog has been converted from IBM Bookmaster format (GML coding) to SGML according to the EPCES DTD. The catalog was approximately 500 pages in length and contained 326 unique illustrations. Information about the EPCES standard and related documents was obtained from the Railroad Industry Forum website (http://www.eccnet.com/rif/).

Because of the well defined and regular structure of the source files, it was possible to perform the conversion to SGML via a largely programmatic process. The creation of hyperlinked "hotspots" from drawings to part information was done manually. Manual tagging, when necessary, was done in an isolated and controlled environment.

This letter discusses areas where the data was not well modeled by the DTD, technical points relating to the DTD which were a source of confusion, and potential guidelines for authors of material which will be maintained and exchanged according to the EPCES standard. The objective of this letter is to provide feedback for the continuing development of the EPCES standard and to serve as a reference for those implementing or planning to implement the standard.


Areas in which the DTD did not model the data well

The following section is organized by element in the EPCES DTD. The points mentioned here do not necessarily imply a deficiency in the DTD. They may imply a need for rethinking how the data is prepared, presented and/or structured. Some points may also indicate areas of difficulty for legacy conversion to EPCES.

graphic

The content model of an epc-fig element requires a figure element and an optional parts-list element. The figure element requires at least one graphic element. The graphic element has an implied attribute, "filename", which references a drawing. It was observed in the data that a parts lists is not always accompanied by a drawing. At present, the SGML must contain an empty graphic element which serves only to satisfy the DTD. Having the graphic element optional and the graphic attribute "filename" required would provide a better solution.

title

Using the title element in its various contexts was often problematic. In the intro element, topics do not always have a title. This is probably best resolved by improved content creation. If a title is required, then it should be present.

For chapter, section, subsection and epc-fig elements, the title element is required. These elements were often found to nest immediately within one another. A shortcoming of this is that when titles are not present for each of these elements, one must either duplicate data by inserting the same text in title elements which immediately follow each other, or insert a title element with no content. For this conversion, some data was repeated to resolve this problem. The reasoning behind the decision was that this would provide more context for the user in a hypertext driven system. The main issue here is the proper identification of the structure to which a title belongs. This is best resolved by a structured authoring environment. This problem is inherent in a legacy conversion.

assoc-text

Running text often occurred in the middle of a parts list and pertained to a number of item-groups. This was handled by using the refint element to provide a hyperlink from the related item-groups to the running text, which was moved to reside physically at the end of the parts-list as assoc-text as prescribed in the DTD. While this is adequate, it does not seem like the best solution. Moreover, it required manual markup on a case by case basis. This is undesirable from the perspective of implementing EPCES for a large amount of legacy catalog data.

If it is genuinely desired to restrict the occurrence of running text to the end of parts lists, then the DTD is fine as it is and authoring practices will need to be modified. If the above occurrences are critical to the accurate interpretation of the information and a cross-reference does not guarantee that the information in the running text will be accessed, then an alternative method of handling this should be investigated.

kits

The kits element must be described in greater detail. It is not clear in the DTD and supporting documentation whether the kit element is intended to contain the part which is described as a "kit", the parts which comprise the kit, or otherwise. Also, kits occurred throughout a parts list, not only at the end. This required rearranging the order of the data. It may be helpful to implementors to mention that this is an acceptable practice. Again, this information would better reside in an "Implementors' Guide".

ref

The element ref is used to enclose cross-referencing elements and revision-start and -end elements within the nomen-col element. The content model of ref is exactly the same as that of the para element, with para additionally allowing emphasis and equ elements. The element ref seems superfluous. If it were a parameter entity, there would be more self consistency within the DTD. The result is a DTD which is easier to understand and use.

hotspot

Hotspots on graphics often refer to more than one call-out number. While one approach to handling this is a guideline which requires all callouts to be explicitly present on the drawing, a complimentary approach would be to make the "ref" attribute of the hotspot element of type "IDREFS". This allows multiple item-groups to be referenced by a single callout. At present, this type of situation is handled by placing all parts referenced into a single item-group element. Changing the ref attribute to be of type IDREFS would make the DTD better suited to handle legacy data.

It should be noted that the hotspot element contains the attribute "synex-af". It is not clear whether this attribute is "proprietary" or suggests/endorses the use of a particular product.

refext

External references to other manuals were handled using the refext element in the title of a section. While this worked, there is certainly room for a better solution. Having these references occur in a title element does not seem to be the most appropriate way to encode this information.

part-nbr

It was observed that some parts in a parts list did not have a part number. This was the case for "common" parts such as "screw", "nut" or "bolt", etc and for some "kits". In these cases, the part-nbr element is left with no content. Ideally, every part should have a part number. A guideline for how an application should handle these cases is out of the scope of the DTD, however it would be helpful to persons who must prepare applications or databases which use EPCES data.

In General

Examples of the intended use of certain elements would be useful, from an implementor's point of view. With this being one of the first implementations, there are no precedents to follow. It is hoped that the practices used in this conversion can serve as the basis for precedents and the development of guidelines for encoding information using the EPCES DTD.


Technical aspects of the DTD which caused confusion

The DTD which models the structure and information content of an illustrated parts catalog (and now prescribes it) was found to adequately support all of the information in the pilot catalog. There were, however, some aspects of the DTD which caused confusion when making data encoding decisions and when designing an application which uses the EPCES standard as a data input format.

ISO minimum literals

The EPCES DTD includes standard character entity sets. These sets are identified by public identifiers beginning with the minimum literal "ISO 8879-1986...". This is strictly speaking incorrect. These literals should be changed to "ISO 8879:1986...". This is a small point but one which should be corrected to maintain the integrity of the standard.

Public Identifiers

The EPCES DTD uses DoD public identifiers versus ISO standard and possibly registered public identifiers to declare non-SGML data notations. The use of public identifiers registered by the ISO wherever possible would maintain the integrity of the standard.

SoftQuad-related Note:
The EPCES DTD declares a CGM notation which is incompatible with Panorama Pro. While this is more Panorama's problem than that of the EPCES DTD, it is still bothersome/frustrating. Panorama does use a registered ISO public identifier for CGM, TIFF and CCITT4 notations.

Tag omission on EMPTY elements

While it is not an error to disallow end tag omission for an element whose content model is declared as EMPTY, the practice is encouraged by the SGML standard for the benefit of human readers of the DTD. The choice of allowing or disallowing end tag omission for EMPTY elements is made inconsistently in the EPCES DTD. The DTD should be made consistent on this point.

Referring to external entities

The standard does not discuss or recommend any method for including external entities in an EPCES document instance. This can be done by either declaring the external entities in the document type declaration subset at the beginning of the instance or by creating a "driver DTD" which includes the EPCES DTD and any other entities required for the instance (this technique is used by the DocBook DTD). The former method is simpler to implement and was used for this project. This type of information would fit best into an "Implementors' Guide" rather than the standard itself.

Content Model Inconsistencies

The elements serial-nbr, equip-id-nbr, comp-loc-nbr and lot-nbr all have a content model of "#PCDATA". This implies that the information is encoded as element content. The elements serial-range, equip-id-range, comp-loc-range and lot-range are all EMPTY elements with attributes to hold the information. It appears that both sets of elements serve the same purpose, the only difference being that for one set, the information is a single value and for the other it is a range of values. For consistency, these elements should all have the same content models. This would make the DTD easier to understand and easier to implement.


Suggestions for authoring practices

With any migration to an SGML-based information management strategy, there are techniques and practices for content creation which can streamline the changeover and increase the success of the new systems. A few are mentioned here, based on limited knowledge of the present content creation methods. It is not known how feasible it would be to implement some of these practices. This section does not directly address issues of legacy conversion.

Any standard can benefit from an "Implementors' Guide" which provides examples, recommendations and techniques which are out of the scope of the standard itself, but are invaluable for reducing the effort of implementing the standard and increasing the likelihood of success. The development of such a guide, which could be placed at the RIF website, should be considered for EPCES.

Creating Drawings

Many applications using EPCES data are expected to be "drawing-driven", where access to part information is via hyperlinks on areas of drawings. In this situation, call-out numbers of the form "2 thru 6" can make it difficult to access information about parts referenced by numbers 3, 4, and 5. While the motivation of saving space on paper is a legitimate one, call-out numbers should be explicitly listed wherever possible.

It should be noted that for this project, the "2 thru 6" problem was resolved at the client's request by including the parts referenced by numbers 2 to 6 in the same item-group element and resolving the link from the "2 thru 6" call-out to that item-group. While this provides access to the parts not explicitly mentioned on the drawing, it does create a relationship among the data due solely to presentation. It also does not allow well for an occurrence of a call-out for reference number 3, 4 or 5 in addition to "2 thru 6".

There is a potential advantage for the engineering drawings to contain the call-out numbers and other text as actual text within the drawing and not as geometric shapes. This leaves open the possibility of programmatically identifying information about regions on graphics which require hyperlinked "hotspots". Data preparation for EPCES might be reduced significantly if such programs were developed and worked reliably.

Drawings should be prepared in proper orientation; there is no need to rotate an image for electronic delivery. Landscape printing of large drawings (if printing is required) should be handled by the application.

Cross-referencing

An alternative method for indicating cross-references based on context rather than physical locations should be used. "See page..." is not effective for electronic delivery. Content artifacts such as this tend to emphasize the "legacy" nature of the data.

Data consolidation

The EPCES standard (and SGML in general) offers an opportunity to leverage a great amount of control over the information in a catalog. In particular, it is now possible to address the "normalization" of the information in the sense of relational database techniques.

The present conversion started from a data source intended for producing printed output, and therefore contained many things which were done in order to produce a useful printed catalog. In particular, there may be parts which are referred to from multiple drawings for which the information is duplicated within the SGML document. Eliminating occurrences such as this was outside of the scope of the conversion. However, it would certainly be desirable to have this done.

The conversion to SGML was done via a largely programmatic process. In addition to the usual advantages of automated data conversion, there were three content-related benefits of the programmatic approach. The first was the detection of subtle inconsistencies in the data. For example, the use of the letter "l" where a number "1" was intended. Secondly, editorial errors were discovered, such as the duplication of a page of the catalog. The third benefit was the detection of subtle errors in Bookmaster coding which only by chance did not result in incorrect printing and were therefore previously unnoticed. The ability to detect and eliminate these errors results in increased value both for the owner of the information and the recipients of it.


Conclusion

To summarize the points made in this letter: