[Schematron] Help sought: implementation of Character Repertoire in XSLT2 for embedding in schematron

Rick Jelliffe rjelliffe at allette.com.au
Mon Sep 22 02:46:31 EDT 2008

David Carlisle wrote:
> 2008/9/20 Rick Jelliffe <rjelliffe at allette.com.au 
> <mailto:rjelliffe at allette.com.au>>
>     Dave Pawson wrote:
>     > So it would make sense to reduce the number of calls from one per
>     > character
>     > at the top of this heap? Processing a 'string' (however
>     obtained) would
>     > reduce that overhead by len(string) calls?
>     >
>     All I am implementing at the moment is a string-level check.
> I don't see how you can do a string level check in general. If the 
> example is just a union of char (as are all the examples in the spec) 
> then you could make a single regexp just |-ing them all together, but 
> even in this case
> the crepdl spec warns that the number of cases if made into a single 
> regex would likely be too large for some regex engines.
So you need to use a regex engine that is good enough. That is a matter 
for users to harass developers about!

The problem I have found with the regex engine in Java is stack overflow 
on large documents rather than with large regexes. Apart from increasing 
the size of the JVM stack, the way to overcome this is to avoid testing 
all the text in a document at once, but rather to test each  text() node 
individually. So perhaps rather than generating 
  <rule context="x">
        <assert test="matches(.,  $theRegex)" >

I should generate something like 
  <rule context="x//text()">
         <assert test="matches(., $theRegex) )">
> In general where you have differencing and hulls etc, I think it would 
> be rather hard to construct a regexp that checked the whole string. 
> Certainly the xpath in the code that I posted only works one character 
> at a time.
Yes, your regular expressions use match() for the <char> element and 
XPath logic for the other parts.

The regexes I was working on (before I got a brain spasm about the 
Unicode Regex syntax, now sorted out thanks) try to put everything into 
a big fat regex, except for some top-level items.

I guess the best approach for Schematron reporting would be
  0) test each text() node individually
  1) generate separate assertions where possible, so for better error 
reporting granularity. In particular, for top-level intersections, and 
for top-level three-level logic.
  2) convert regexes to use XPath logic as much as possible, to reduce 
the size of each particular compiled regex
  3) after that, use the closest regex possible, with a 
no-spurious-negatives policy for assertions

I wonder if there is a way to use tokenize() to actually locate problem 
characters? Hmmm...


More information about the Schematron mailing list