[Schematron] Help sought: implementation of Character Repertoire in XSLT2 for embedding in schematron

Rick Jelliffe rjelliffe at allette.com.au
Mon Sep 22 04:08:16 EDT 2008


The context of this is the properties element I am proposing for the new 
Schematron, to meet several different requirements from various sources: 
datatyping (integration of CRepDL, XSD datatypes, DTLL, etc), richer 
output in SVRL, etc)   The new skeleton will have these as an 
experimental feature, if I get the test code written, today.

So the schema that the user would write would be

 <schema ...><title>Stubbed schema to show embedded CRepDL script</title>
 <pattern>
   <rule test="x">
    <assert test="true()" properties="iso8859-1-text" >The <name/> 
element should only contain ISO 8859-1 text</assert>
   </rule>
 </pattern>
 
  <properties>
    <propert id="iso8859-1-text">
       <cdrl:union>
           <cdrl:char>
           ....
       </cdrl:union>
   </property>
  </properties>
</schema>
   
where the preprocessor takes this and generates

<schema ...><title>Stubbed schema to show embedded CRepDL script</title>
 <pattern>

   <rule test="x">
    <assert test="true()" properties="iso8859-1-text" >The <name/> 
element should only contain ISO 8859-1 text</assert>
   </rule>

   <rule test="x//text()">
    <assert test="matches(., ' theRegex' ) " >The <name/> element should 
only contain ISO 8859-1 text</assert>
   </rule>
 </pattern>

  <properties>
    <propert id="iso8859-1-text">
       <cdrl:union>
           <cdrl:char>
           ....
       </cdrl:union>
   </property>
  </properties>
 
</schema>

Ie. when an assertion has a property that is a character repertoire then 
generate a rule with assertions for the appropriate regexes on the text 
under the context*

Cheers
Rick

*  Actually it is the text under the "subject", ie. concat(@context, 
@subject), because assertions do not define things about the context: 
they implement the assertion text in as best a way they can, and the 
optimal split of Xpaths between the rule/@context and assert/@test may 
cause the context to be something different than the subject of interest 
in the text assertion. 
 

Dave Pawson wrote:
> I may be a mile off. If so please tell me.
>
>
> 2008/9/22 Rick Jelliffe <rjelliffe at allette.com.au>:
>
>   
>> I should generate something like
>>  <rule context="x//text()">
>>         <assert test="matches(., $theRegex) )">
>>     
>
>
> crdl talks about (at the top level)
> in, not-in, unknown.
>
> So shouldn't the assertion be one of those?
> The simplest being
> <assert test="in (. , $repertoire-reference)">
>
> Surely that's what the user wants to know?
> Everything else should be below the water level, so to speak.
>
>
>
>   
>> I guess the best approach for Schematron reporting would be
>>  0) test each text() node individually
>>  1) generate separate assertions where possible, so for better error
>> reporting granularity. In particular, for top-level intersections, and
>> for top-level three-level logic.
>>  2) convert regexes to use XPath logic as much as possible, to reduce
>> the size of each particular compiled regex
>>  3) after that, use the closest regex possible, with a
>> no-spurious-negatives policy for assertions
>>
>> I wonder if there is a way to use tokenize() to actually locate problem
>> characters? Hmmm...
>>     
>
>
> If (as David keeps asserting - and I believe him) the actual tests are done
> one by one, say with a recursive function, then the character will be
> available.... just that how do you return multiple values (false, \u1224) from
> a function?
>
> regards
>
>
>
>
>   




More information about the Schematron mailing list