Package: GNAT.Regpat

Description

GNAT.Spitbol.Patterns (files g-spipat.ads/g-spipat.adb) This is a completely general pattern matching package based on the pattern language of SNOBOL4, as implemented in SPITBOL. The pattern language is modeled on context free grammars, with context sensitive extensions that provide full (type 0) computational capabilities.

Header

package GNAT.Regpat is
 
pragma Preelaborate (Regpat);

The grammar is the following:

regexp ::= expr ::= ^ expr -- anchor at the beginning of string ::= expr $ -- anchor at the end of string expr ::= term ::= term | term -- alternation (term or term ...) term ::= item ::= item item ... -- concatenation (item then item) item ::= elmt -- match elmt ::= elmt * -- zero or more elmt's ::= elmt + -- one or more elmt's ::= elmt ? -- matches elmt or nothing ::= elmt *? -- zero or more times, minimum number ::= elmt +? -- one or more times, minimum number ::= elmt ?? -- zero or one time, minimum number ::= elmt { num } -- matches elmt exactly num times ::= elmt { num , } -- matches elmt at least num times ::= elmt { num , num2 } -- matches between num and num2 times ::= elmt { num }? -- matches elmt exactly num times ::= elmt { num , }? -- matches elmt at least num times non-greedy version ::= elmt { num , num2 }? -- matches between num and num2 times non-greedy version elmt ::= nchr -- matches given character ::= [range range ...] -- matches any character listed ::= [^ range range ...] -- matches any character not listed ::= . -- matches any single character -- except newlines ::= ( expr ) -- parens used for grouping ::= \ num -- reference to num-th parenthesis range ::= char - char -- matches chars in given range ::= nchr ::= [: posix :] -- any character in the POSIX range ::= [:^ posix :] -- not in the POSIX range posix ::= alnum -- alphanumeric characters ::= alpha -- alphabetic characters ::= ascii -- ascii characters (0 .. 127) ::= cntrl -- control chars (0..31, 127..159) ::= digit -- digits ('0' .. '9') ::= graph -- graphic chars (32..126, 160..255) ::= lower -- lower case characters ::= print -- printable characters (32..127) ::= punct -- printable, except alphanumeric ::= space -- space characters ::= upper -- upper case characters ::= word -- alphanumeric characters ::= xdigit -- hexadecimal chars (0..9, a..f)

char ::= any character, including special characters ASCII.NUL is not supported. nchr ::= any character except \()[].*+?^ or \char to match char \n means a newline (ASCII.LF) \t means a tab (ASCII.HT) \r means a return (ASCII.CR) \b matches the empty string at the beginning or end of a word. A word is defined as a set of alphanumerical characters (see \w below). \B matches the empty string only when *not* at the beginning or end of a a word. \d matches any digit character ([0-9]) \D matches any non digit character ([^0-9]) \s matches any white space character. This is equivalent to [ \t\n\r\f\v] (tab, form-feed, vertical-tab,... \S matches any non-white space character. \w matches any alphanumeric character or underscore. This include accented letters, as defined in the package Ada.Characters.Handling. \W matches any non-alphanumeric character. \A match the empty string only at the beginning of the string, whatever flags are used for Compile (the behavior of ^ can change, see Regexp_Flags below). \G match the empty string only at the end of the string, whatever flags are used for Compile (the behavior of $ can change, see Regexp_Flags below). ... ::= is used to indication repetition (one or more terms)

Embedded newlines are not matched by the ^ operator. It is possible to retrieve the substring matched a parenthesis expression. Although the depth of parenthesis is not limited in the regexp, only the first 9 substrings can be retrieved.

The highest value possible for the arguments to the curly operator ({}) are given by the constant Max_Curly_Repeat below.

The operators '*', '+', '?' and '{}' always match the longest possible substring. They all have a non-greedy version (with an extra ? after the operator), which matches the shortest possible substring. For instance: regexp="<.*>" string="

title

" matches="

title

" regexp="<.*?>" string="

title

" matches="

" '{' and '}' are only considered as special characters if they appear in a substring that looks exactly like '{n}', '{n,m}' or '{n,}', where n and m are digits. No space is allowed. In other contexts, the curly braces will simply be treated as normal characters.


Exceptions

Expression_Error
This exception is raised when trying to compile an invalid regular expression. All subprograms taking an expression as parameter may raise Expression_Error.

Type Summary

Match_Array
Match_Location
Pattern_Matcher
Primitive Operations:  Compile, Compile, Compile, Dump, Match, Match, Paren_Count
Program_Size
Regexp_Flags

Constants and Named Numbers

Case_Insensitive : constant Regexp_Flags;
The automaton is optimized so that the matching is done in a case insensitive manner (upper case characters and lower case characters are all treated the same way).
Max_Curly_Repeat : constant := 32767;
Maximum number of repetition for the curly operator. The digits in the {n}, {n,} and {n,m } operators can not be higher than this constant, since they have to fit on two characters in the byte-compiled version of regular expressions.
Max_Parenthesis : constant := 255;
Maximum number of parenthesis in a regular expression. This is limited by the size of a Character, as found in the byte-compiled version of regular expressions.
Max_Program_Size : constant := 2**15 - 1;
Maximum size that can be allocated for a program.
Multiple_Lines : constant Regexp_Flags;
Treat the Data as multiple lines. This means that ^ and $ will also match on internal newlines (ASCII.LF), in addition to the beginning and end of the string.

This can be combined with Single_Line.

No_Flags : constant Regexp_Flags;
No_Match : constant Match_Location := (First => 0, Last => 0);
The No_Match constant is (0, 0) to differentiate between matching a null string at position 1, which uses (1, 0) and no match at all.
Single_Line : constant Regexp_Flags;
Treat the Data we are matching as a single line. This means that ^ and $ will ignore \n (unless Multiple_Lines is also specified), and that '.' will match \n.

Other Items:

type Program_Size is range 0 .. Max_Program_Size;
for Program_Size'Size use 16;
Number of bytes allocated for the byte-compiled version of a regular expression.

type Regexp_Flags is mod 256;
for Regexp_Flags'Size use 8;
Flags that can be given at compile time to specify default properties for the regular expression.

subtype Match_Count is Natural range 0 .. Max_Parenthesis;
Match_Array

type Match_Location is record
   First   : Natural := 0;
   Last    : Natural := 0;
end record;

type Match_Array is array (Match_Count range <>) of Match_Location;
The substring matching a given pair of parenthesis. Index 0 is the whole substring that matched the full regular expression. For instance, if your regular expression is something like: "a(b*)(c+)", then Match_Array(1) will be the indexes of the substring that matched "b*" and Match_Array(2) will be the substring that matched "c+".

The number of parenthesis groups that can be retrieved is unlimited, and all the Match subprograms below can use a Match_Array of any size. Indexes that do not have any matching parenthesis are set to No_Match.


type Pattern_Matcher (Size : Program_Size) is private;
Pattern_Matcher Creation

function Compile (Expression : String;
                  Flags      : Regexp_Flags := No_Flags)
                 return       Pattern_Matcher;
Compile a regular expression into internal code. Raises Expression_Error if Expression is not a legal regular expression. The appropriate size is calculated automatically, but this means that the regular expression has to be compiled twice (the first time to calculate the size, the second time to actually generate the byte code).

Flags is the default value to use to set properties for Expression (case sensitivity,...).


procedure Compile
  (Matcher         : out Pattern_Matcher;
   Expression      : String;
   Final_Code_Size : out Program_Size;
   Flags           : Regexp_Flags := No_Flags);
Compile a regular expression into into internal code This procedure is significantly faster than the function Compile, as there is a known maximum size for the matcher. This function raises Storage_Error if Matcher is too small to hold the resulting code, or Expression_Error is Expression is not a legal regular expression.

Flags is the default value to use to set properties for Expression (case sensitivity,...).


procedure Compile (Matcher    : out Pattern_Matcher;
                   Expression : String;
                   Flags      : Regexp_Flags := No_Flags);
Same procedure as above, expect it does not return the final program size.

function Paren_Count (Regexp : Pattern_Matcher) return Match_Count;
Return the number of parenthesis pairs in Regexp. This is the maximum index that will be filled if a Match_Array is used as an argument to Match. Thus, if you want to be sure to get all the parenthesis, you should do something like: declare Regexp : Pattern_Matcher := Compile ("a(b*)(c+)"); Matched : Match_Array (0 .. Num_Parenthesis (Regexp)); begin Match (Regexp, "a string", Matched); end;

function Quote (Str : String) return String;
Return a version of Str so that every special character is quoted. The resulting string can be used in a regular expression to match exactly Str, whatever character was present in Str.

procedure Match
  (Expression     : String;
   Data           : String;
   Matches        : out Match_Array;
   Size           : Program_Size := 0);
Match Expression against Data and store result in Matches. Function raises Storage_Error if Size is too small for Expression, or Expression_Error if Expression is not a legal regular expression. If Size is 0, then the appropriate size is automatically calculated by this package, but this is slightly slower.

At most Matches'Length parenthesis are returned.


function  Match
  (Expression : String;
   Data       : String;
   Size       : Program_Size := 0)
   return       Natural;
Return the position where Data matches, or (Data'First - 1) if there is no match. Function raises Storage_Error if Size is too small for Expression or Expression_Error if Expression is not a legal regular expression If Size is 0, then the appropriate size is automatically calculated by this package, but this is slightly slower.

function Match
  (Expression : String;
   Data       : String;
   Size       : Program_Size := 0)
   return       Boolean;
Return True if Data matches Expression. Match raises Storage_Error if Size is too small for Expression, or Expression_Error if Expression is not a legal regular expression.

If Size is 0, then the appropriate size is automatically calculated by this package, but this is slightly slower.


function  Match
  (Self : Pattern_Matcher;
   Data : String)
   return Natural;
Return the position where Data matches, or (Data'First -1) if there is no match. Raises Expression_Error if Expression is not a legal regular expression.

pragma Inline (Match);
All except the last one below.

procedure Match
  (Self    : Pattern_Matcher;
   Data    : String;
   Matches : out Match_Array);
Match Data using the given pattern matcher and store result in Matches. Raises Expression_Error if Expression is not a legal regular expression. The expression matches if Matches (0) /= No_Match.

At most Matches'Length parenthesis are returned.


procedure Dump (Self : Pattern_Matcher);
Dump the compiled version of the regular expression matched by Self.

private

   --  Implementation-defined ...
end GNAT.Regpat;