La documentación de Regexp::Grammars establece cual es el problema que aborda el módulo:
...Perl5.10 makes possible to use regexes to recognize complex, hierarchical-and even recursive-textual structures. The problem is that Perl 5.10 doesn’t provide any support for extracting that hierarchical data into nested data structures. In other words, using Perl 5.10 you can match complex data, but not parse it into an internally useful form.
An additional problem when using Perl 5.10 regexes to match
complex data formats is that you have to make sure you remember
to insert whitespace- matching constructs (such as \s*
) at every
possible position where the data might contain ignorable whitespace. This
reduces the readability of such patterns, and increases the chance of
errors (typically caused by overlooking a location where whitespace
might appear).
The Regexp::Grammars module solves both those problems.
If you import the module into a particular lexical scope, it preprocesses any regex in that scope, so as to implement a number of extensions to the standard Perl 5.10 regex syntax. These extensions simplify the task of defining and calling subrules within a grammar, and allow those subrule calls to capture and retain the components of they match in a proper hierarchical manner.
Las expresiones regulares Regexp::Grammars aumentan las regexp Perl 5.10. La sintáxis se expande y se modifica:
A Regexp::Grammars specification consists of a pattern (which may include both standard Perl 5.10 regex syntax, as well as special Regexp::Grammars directives), followed by one or more rule or token definitions.
Sigue un ejemplo:
pl@nereida:~/Lregexpgrammars/demo$ cat -n balanced_brackets.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 (<pp>) 10 11 <rule: pp> \( (?: [^()]*+ | <escape> | <pp> )* \) 12 13 <token: escape> \\. 14 15 }xs; 16 }; 17 18 while (my $input = <>) { 19 while ($input =~ m{$rbb}g) { 20 say("matches: <$&>"); 21 say Dumper \%/; 22 } 23 }
Note that there is no need to explicitly place \s*
subpatterns throughout the rules; that is taken care of automatically.
...
The initial pattern ((<pp>)
) acts like the top rule of the grammar, and must
be matched completely for the grammar to match.
The rules and tokens are declarations only and they are not directly matched. Instead, they act like subroutines, and are invoked by name from the initial pattern (or from within a rule or token).
Each rule or token extends from the directive that introduces it up to either the next rule or token directive, or (in the case of the final rule or token) to the end of the grammar.
(2*(3+5))*4+(2-3)
produce:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 balanced_brackets.pl (2*(3+5))*4+(2-3) matches: <(2*(3+5))> $VAR1 = { '' => '(2*(3+5))', 'pp' => { '' => '(2*(3+5))', 'pp' => '(3+5)' } }; matches: <(2-3)> $VAR1 = { '' => '(2-3)', 'pp' => '(2-3)' };
Each rule calls the subrules specified within it, and then return a hash containing whatever result each of those subrules returned, with each result indexed by the subrule’s name.
In this way, each level of the hierarchical regex can generate hashes recording everything its own subrules matched, so when the entire pattern matches, it produces a tree of nested hashes that represent the structured data the pattern matched.
...
In addition each result-hash has one extra key: the empty string. The value for this key is whatever string the entire subrule call matched.
The difference between a token and a rule is that a token treats any whitespace within it exactly as a normal Perl regular expression would. That is, a sequence of whitespace in a token is ignored if the/x
modifier is in effect, or else matches the same literal sequence of whitespace characters (if/x
is not in effect).
En el ejemplo anterior el comportamiento es el mismo si se reescribe la regla
para el token escape
como:
13 <rule: escape> \\.En este otro ejemplo mostramos que la diferencia entre token y rule es significativa:
pl@nereida:~/Lregexpgrammars/demo$ cat -n tokenvsrule.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 <s> 10 11 <rule: s> <a> <c> 12 13 <rule: c> c d 14 15 <token: a> a b 16 17 }xs; 18 }; 19 20 while (my $input = <>) { 21 if ($input =~ m{$rbb}) { 22 say("matches: <$&>"); 23 say Dumper \%/; 24 } 25 else { 26 say "Does not match"; 27 } 28 }
Al ejecutar este programa vemos la diferencia en la interpretación de los blancos:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 tokenvsrule.pl ab c d matches: <ab c d> $VAR1 = { '' => 'ab c d', 's' => { '' => 'ab c d', 'c' => 'c d', 'a' => 'ab' } }; a b c d Does not match ab cd matches: <ab cd> $VAR1 = { '' => 'ab cd', 's' => { '' => 'ab cd', 'c' => 'cd', 'a' => 'ab' } };Obsérvese como la entrada
a b c d
es rechazada mientras
que la entrada ab c d
es aceptada.
In a rule, any sequence of whitespace (except those at the very start and the very end of the rule) is treated as matching the implicit subrule<.ws>
, which is automatically predefined to match optional whitespace (i.e.\s*
).
You can explicitly define a<ws>
token to change that default behaviour. For example, you could alter the definition of whitespace to include Perlish comments, by adding an explicit<token: ws>
:
<token: ws> (?: \s+ | #[^\n]* )*
But be careful not to define <ws>
as a rule, as this will lead
to all kinds of infinitely recursive unpleasantness.
El siguiente ejemplo ilustra como redefinir <ws>
:
pl@nereida:~/Lregexpgrammars/demo$ cat -n tokenvsruleandws.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 no warnings 'uninitialized'; 9 qr{ 10 <s> 11 12 <token: ws> (?: \s+ | /\* .*? \*/)*+ 13 14 <rule: s> <a> <c> 15 16 <rule: c> c d 17 18 <token: a> a b 19 20 }xs; 21 }; 22 23 while (my $input = <>) { 24 if ($input =~ m{$rbb}) { 25 say Dumper \%/; 26 } 27 else { 28 say "Does not match"; 29 } 30 }Ahora podemos introducir comentarios en la entrada:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 -w tokenvsruleandws.pl ab /* 1 */ c d $VAR1 = { '' => 'ab /* 1 */ c d', 's' => { '' => 'ab /* 1 */ c d', 'c' => 'c d', 'a' => 'ab' } };
To invoke a rule to match at any point, just enclose the rule’s name in angle brackets (like in Perl 6). There must be no space between the opening bracket and the rulename. For example:
qr{ file: # Match literal sequence 'f' 'i' 'l' 'e' ':' <name> # Call <rule: name> <options>? # Call <rule: options> (it's okay if it fails) <rule: name> # etc. }x;
If you need to match a literal pattern that would otherwise look like a subrule call, just backslash-escape the leading angle:
qr{ file: # Match literal sequence 'f' 'i' 'l' 'e' ':' \<name> # Match literal sequence '<' 'n' 'a' 'm' 'e' '>' <options>? # Call <rule: options> (it's okay if it fails) <rule: name> # etc. }x;
El siguiente programa ilustra algunos puntos discutidos en la cita anterior:
casiano@millo:~/src/perl/regexp-grammar-examples$ cat -n badbracket.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 (<pp>) 10 11 <rule: pp> \( (?: <b > | \< | < escape> | <pp> )* \) 12 13 <token: b > b 14 15 <token: escape> \\. 16 17 }xs; 18 }; 19 20 while (my $input = <>) { 21 while ($input =~ m{$rbb}g) { 22 say("matches: <$&>"); 23 say Dumper \%/; 24 } 25 }
Obsérvense los blancos en < escape>
y en <token: b > b
.
Pese a ello el programa funciona:
casiano@millo:~/src/perl/regexp-grammar-examples$ perl5.10.1 badbracket.pl (\(\)) matches: <(\(\))> $VAR1 = { '' => '(\\(\\))', 'pp' => { '' => '(\\(\\))', 'escape' => '\\)' } }; (b) matches: <(b)> $VAR1 = { '' => '(b)', 'pp' => { '' => '(b)', 'b' => 'b' } }; (<) matches: <(<)> $VAR1 = { '' => '(<)', 'pp' => '(<)' }; (c) casiano@millo:
...Note, however, that if the result-hash at any level contains only the empty-string key (i.e. the subrule did not call any sub-subrules or save any of their nested result-hashes), then the hash is unpacked and just the matched substring itself if returned.
For example, if <rule: sentence>
had been defined:
<rule: sentence> I see dead people
then a successful call to the rule would only add:
sentence => 'I see dead people'
to the current result-hash.
This is a useful feature because it prevents a series of nested subrule calls from producing very unwieldy data structures. For example, without this automatic unpacking, even the simple earlier example:
<rule: sentence> <noun> <verb> <object>
would produce something needlessly complex, such as:
sentence => { "" => 'I saw a dog', noun => { "" => 'I', }, verb => { "" => 'saw', }, object => { "" => 'a dog', article => { "" => 'a', }, noun => { "" => 'dog', }, }, }
El siguiente ejemplo ilustra este punto:
pl@nereida:~/Lregexpgrammars/demo$ cat -n unaryproductions.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 <s> 10 11 <rule: s> <noun> <verb> <object> 12 13 <token: noun> he | she | Peter | Jane 14 15 <token: verb> saw | sees 16 17 <token: object> a\s+dog | a\s+cat 18 19 }x; 20 }; 21 22 while (my $input = <>) { 23 while ($input =~ m{$rbb}g) { 24 say("matches: <$&>"); 25 say Dumper \%/; 26 } 27 }
Sigue una ejecución del programa anterior:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 unaryproductions.pl he saw a dog matches: <he saw a dog> $VAR1 = { '' => 'he saw a dog', 's' => { '' => 'he saw a dog', 'object' => 'a dog', 'verb' => 'saw', 'noun' => 'he' } }; Jane sees a cat matches: <Jane sees a cat> $VAR1 = { '' => 'Jane sees a cat', 's' => { '' => 'Jane sees a cat', 'object' => 'a cat', 'verb' => 'sees', 'noun' => 'Jane' } };
Cuando se usa Regexp::Grammars
como parte de
un programa que utiliza otras regexes hay que evitar
que Regexp::Grammars
procese las mismas. Regexp::Grammars
reescribe las expresiones regulares durante la fase de preproceso. Esta por ello
presenta las mismas limitaciones que cualquier otra forma de
'source filtering' (véase perlfilter). Por ello es una buena idea declarar
la gramática en un bloque do
restringiendo de esta forma el ámbito de
acción del módulo.
5 my $calculator = do{ 6 use Regexp::Grammars; 7 qr{ . ........ 28 }xms 29 };