[Schematron] Help sought: implementation of Character Repertoire in XSLT2 for embedding in schematron
Rick Jelliffe
rjelliffe at allette.com.au
Mon Sep 22 02:46:31 EDT 2008
David Carlisle wrote:
> 2008/9/20 Rick Jelliffe <rjelliffe at allette.com.au
> <mailto:rjelliffe at allette.com.au>>
>
> Dave Pawson wrote:
> > So it would make sense to reduce the number of calls from one per
> > character
> > at the top of this heap? Processing a 'string' (however
> obtained) would
> > reduce that overhead by len(string) calls?
> >
> All I am implementing at the moment is a string-level check.
>
>
> I don't see how you can do a string level check in general. If the
> example is just a union of char (as are all the examples in the spec)
> then you could make a single regexp just |-ing them all together, but
> even in this case
> the crepdl spec warns that the number of cases if made into a single
> regex would likely be too large for some regex engines.
So you need to use a regex engine that is good enough. That is a matter
for users to harass developers about!
The problem I have found with the regex engine in Java is stack overflow
on large documents rather than with large regexes. Apart from increasing
the size of the JVM stack, the way to overcome this is to avoid testing
all the text in a document at once, but rather to test each text() node
individually. So perhaps rather than generating
<rule context="x">
<assert test="matches(., $theRegex)" >
I should generate something like
<rule context="x//text()">
<assert test="matches(., $theRegex) )">
> In general where you have differencing and hulls etc, I think it would
> be rather hard to construct a regexp that checked the whole string.
> Certainly the xpath in the code that I posted only works one character
> at a time.
Yes, your regular expressions use match() for the <char> element and
XPath logic for the other parts.
The regexes I was working on (before I got a brain spasm about the
Unicode Regex syntax, now sorted out thanks) try to put everything into
a big fat regex, except for some top-level items.
I guess the best approach for Schematron reporting would be
0) test each text() node individually
1) generate separate assertions where possible, so for better error
reporting granularity. In particular, for top-level intersections, and
for top-level three-level logic.
2) convert regexes to use XPath logic as much as possible, to reduce
the size of each particular compiled regex
3) after that, use the closest regex possible, with a
no-spurious-negatives policy for assertions
I wonder if there is a way to use tokenize() to actually locate problem
characters? Hmmm...
Cheers
Rick
More information about the Schematron
mailing list