[Schematron] Help sought: implementation of Character Repertoire in XSLT2 for embedding in schematron

Rick Jelliffe rjelliffe at allette.com.au
Thu Sep 18 13:14:13 EDT 2008


David Carlisle wrote:
> 2008/9/18 Rick Jelliffe <rjelliffe at allette.com.au 
> <mailto:rjelliffe at allette.com.au>>
>
>
>     The XSD spec defines a set, based on Perl.  But XSLT2 refers back
>     to the Unicode regexes. The two have significant
>     differences:  for example the availability of  && operators, the
>     use of || rather than |, and the ability to have nested
>     [ items ].
>
>
> the reference to Unicode regex is only an informative note.
> XPath regex are defined to be  the same as XSD's except for 5 
> extensions as listed in 7.6.1 of the F&O spec. As far as I can see no 
> use is made of ||, as an operator.
But it is in the Unicode regex spec, and | is not (the use of | in the 
BNF in that spec may make it look like it is there, though).  The 
Unicode spec has explicit intersection and difference operators too, 
which would come in handy. And I don't see any reference to
( or ) in the Unicode spec: if it is just a guideline for augmenting 
other syntaxes that is frustrating but positive.
> Actually wouldn't it be easier to do some of the constructs at the 
> xpath level rather than as a single regex.
> ie map intersection to
> matches(....) and matches (....)
> so using xpath and rather than trying to build a single regex?
Yes, I actually use that explicit 'and'.  And top-level intersections 
can be made using separate <assert> statements.

And also it is quite possible to make a weaker validator, where for 
example intersections are modeled using unions:
no false negatives at least.
>  
>
>      what is needed is to figure out what
>     kind of regular expressions SAXON 9 actually implements, and to
>     generate that. I suspect that every different
>     implementation of XSLT2 or EXSLT will have a different regex
>     library in practise!
>
>
>
> If you find saxon differs from what's specified I suspect Mike would 
> pretty quickly fix that, but if you allow yourself to use xpath 
> operators rather than just regex, most likely the darker corners of 
> the regex handling can be avoided in any case.
No disrespect to Mike or Saxon intended. The cases I mentioned were the 
ones I found. Also whether to use \uHHHH or \x#HHHH for character 
references.

But the good news is that most of the schemas I have seen use fairly 
simple structures: I think Murata-san's examples all use top-level 
<union>s containing <char>s.

Cheers
Rick Jelliffe



More information about the Schematron mailing list