[Schematron] Fwd: Help sought: implementation of Character Repertoire in XSLT2 for embedding in schematron

David Carlisle d.p.carlisle at googlemail.com
Mon Sep 22 05:02:03 EDT 2008


sorry dropped off the list by mistake

---------- Forwarded message ----------
From: David Carlisle <d.p.carlisle at googlemail.com>
Date: 2008/9/22
Subject: Re: [Schematron] Help sought: implementation of Character
Repertoire in XSLT2 for embedding in schematron
To: Rick Jelliffe <rjelliffe at allette.com.au>




2008/9/22 Rick Jelliffe <rjelliffe at allette.com.au>

Yes, your regular expressions use match() for the <char> element and XPath
> logic for the other parts.
>
> The regexes I was working on (before I got a brain spasm about the Unicode
> Regex syntax, now sorted out thanks) try to put everything into a big fat
> regex, except for some top-level items.


this is clearly possible (because every repertoire is a partition of the
unicode characters into three disjoint sets, which must necessarily be
expressable
as regexp.

However unless you really expand out the crepdl schema into teh underlying
sets of integers and then re-constitute (which may be hard in xslt) I don't
see how you can do this in a regexp.

the intersection of [a-c] and [c-d] for example if you can spot that this is
c then you can generate the regexp ^c*$ to check that an arbitrary length
string just consists of c's. At the xpath level you can generate

string-to-codepoints(.)/codepoint-to-string(.)/
    (matches(.'[a-c]') and matches(.,'[c-d]'))

which will check every character but I don't know of a regexp that yoyu can
apply to the whole string that will do this check that you can make just by
joining the regex fragments from the crepdl schema.



>
> I guess the best approach for Schematron reporting would be
>  0) test each text() node individually


So this would be good, but I don't know how to do it (that doesn't mean that
it is not possible)


>
> I wonder if there is a way to use tokenize() to actually locate problem
> characters? Hmmm...
>

if you _could_  generate a regex that matched the characters in the
repertoire
tokenize would retiurn the characters that did not match, yes. Or if
checking every character separtely, as above then of course that
functionality is automatic.

>
> Cheers
> Rick
>



-- 
http://dpcarlisle.blogspot.com/



-- 
http://dpcarlisle.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.eccnet.com/pipermail/schematron/attachments/20080922/a8f363ce/attachment.html 


More information about the Schematron mailing list