[Schematron] Help sought: implementation of Character Repertoire in XSLT2 for embedding in schematron
rjelliffe at allette.com.au
Thu Sep 18 13:14:13 EDT 2008
David Carlisle wrote:
> 2008/9/18 Rick Jelliffe <rjelliffe at allette.com.au
> <mailto:rjelliffe at allette.com.au>>
> The XSD spec defines a set, based on Perl. But XSLT2 refers back
> to the Unicode regexes. The two have significant
> differences: for example the availability of && operators, the
> use of || rather than |, and the ability to have nested
> [ items ].
> the reference to Unicode regex is only an informative note.
> XPath regex are defined to be the same as XSD's except for 5
> extensions as listed in 7.6.1 of the F&O spec. As far as I can see no
> use is made of ||, as an operator.
But it is in the Unicode regex spec, and | is not (the use of | in the
BNF in that spec may make it look like it is there, though). The
Unicode spec has explicit intersection and difference operators too,
which would come in handy. And I don't see any reference to
( or ) in the Unicode spec: if it is just a guideline for augmenting
other syntaxes that is frustrating but positive.
> Actually wouldn't it be easier to do some of the constructs at the
> xpath level rather than as a single regex.
> ie map intersection to
> matches(....) and matches (....)
> so using xpath and rather than trying to build a single regex?
Yes, I actually use that explicit 'and'. And top-level intersections
can be made using separate <assert> statements.
And also it is quite possible to make a weaker validator, where for
example intersections are modeled using unions:
no false negatives at least.
> what is needed is to figure out what
> kind of regular expressions SAXON 9 actually implements, and to
> generate that. I suspect that every different
> implementation of XSLT2 or EXSLT will have a different regex
> library in practise!
> If you find saxon differs from what's specified I suspect Mike would
> pretty quickly fix that, but if you allow yourself to use xpath
> operators rather than just regex, most likely the darker corners of
> the regex handling can be avoided in any case.
No disrespect to Mike or Saxon intended. The cases I mentioned were the
ones I found. Also whether to use \uHHHH or \x#HHHH for character
But the good news is that most of the schemas I have seen use fairly
simple structures: I think Murata-san's examples all use top-level
<union>s containing <char>s.
More information about the Schematron