Listas

Sig: Pseudo sub-reglas Sup: Análisis Sintáctico con Regexp::Grammars Ant: Renombrando los resultados de Err: Si hallas una errata ...

Subsecciones

Listas

El operador de cierre positivo

If a subrule call is quantified with a repetition specifier:

           <rule: file_sequence>
               <file>+

then each repeated match overwrites the corresponding entry in the surrounding rule’s result-hash, so only the result of the final repetition will be retained. That is, if the above example matched the string foo.pl bar.py baz.php, then the result-hash would contain:

           file_sequence {
               ""   => 'foo.pl bar.py baz.php',
               file => 'baz.php',
           }

Operadores de listas y espacios en blanco

Existe un caveat con el uso de los operadores de repetición y el manejo de los blancos. Véase el siguiente programa:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers3.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8
 9      qr{
10        <numbers>
11
12        <rule: numbers>
13          (<number>)+
14
15        <token: number> \s*\d+
16      }xms;
17  };
18
19  while (my $input = <>) {
20      if ($input =~ m{$rbb}) {
21          say("matches: <$&>");
22          say Dumper \%/;
23      }
24  }

Obsérvese el uso explícito de espacios \s*\d+ en la definición de number.

Sigue un ejemplo de ejecución:

pl@nereida:~/Lregexpgrammars/demo$ perl5_10_1 numbers3.pl
1 2 3 4
matches: <1 2 3 4>
$VAR1 = {
          '' => '1 2 3 4',
          'numbers' => {
                         '' => '1 2 3 4',
                         'number' => ' 4'
                       }
        };

Si se eliminan los blancos de la definición de number:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5  
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8  
     9      qr{
    10        <numbers>
    11  
    12        <rule: numbers> 
    13          (<number>)+
    14  
    15        <token: number> \d+
    16      }xms;
    17  };
    18  
    19  while (my $input = <>) {
    20      if ($input =~ m{$rbb}) {
    21          say("matches: <$&>");
    22          say Dumper \%/;
    23      }
    24  }

se obtiene una conducta que puede sorprender:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numbers.pl 
12 34 56
matches: <12>
$VAR1 = {
          '' => '12',
          'numbers' => {
                         '' => '12',
                         'number' => '12'
                       }
        };

La explicación está en la documentación: véase la sección Grammar Syntax:

<rule: IDENTIFIER>

Define a rule whose name is specified by the supplied identifier.

Everything following the <rule:...> directive (up to the next <rule:...> or <token:...> directive) is treated as part of the rule being defined.

Any whitespace in the rule is replaced by a call to the <.ws> subrule (which defaults to matching \s*, but may be explicitly redefined).

También podríamos haber resuelto el problema introduciendo un blanco explícito dentro del cierre positivo:

      <rule: numbers>
        (<number> )+

      <token: number> \d+

Una Solución al problema de recordar los resultados de una lista: El uso de brackets

Usually, that’s not the desired outcome, so Regexp::Grammars provides another mechanism by which to call a subrule; one that saves all repetitions of its results.

A regular subrule call consists of the rule’s name surrounded by angle brackets. If, instead, you surround the rule’s name with <[...]> (angle and square brackets) like so:

           <rule: file_sequence>
               <[file]>+

then the rule is invoked in exactly the same way, but the result of that submatch is pushed onto an array nested inside the appropriate result-hash entry. In other words, if the above example matched the same foo.pl bar.py baz.php string, the result-hash would contain:

           file_sequence {
               ""   => 'foo.pl bar.py baz.php',
               file => [ 'foo.pl', 'bar.py', 'baz.php' ],
           }

Teniendo en cuenta lo dicho anteriormente sobre los blancos dentro de los cuantificadores, es necesario introducir blancos dentro del operador de repetición:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers4.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8
     9      qr{
    10        <numbers>
    11
    12        <rule: numbers>
    13          (?:  <[number]> )+
    14
    15        <token: number> \d+
    16      }xms;
    17  };
    18
    19  while (my $input = <>) {
    20      if ($input =~ m{$rbb}) {
    21          say("matches: <$&>");
    22          say Dumper \%/;
    23      }
    24  }

Al ejecutar este programa obtenemos:

pl@nereida:~/Lregexpgrammars/demo$ perl5_10_1 numbers4.pl
1 2 3 4
matches: <1 2 3 4
>
$VAR1 = {
          '' => '1 2 3 4
',
          'numbers' => {
                         '' => '1 2 3 4
',
                         'number' => [ '1', '2', '3', '4' ]
                       }
        };

Otra forma de resolver las colisiones de nombres: salvarlos en una lista

This listifying subrule call can also be useful for non-repeated subrule calls, if the same subrule is invoked in several places in a grammar. For example if a cmdline option could be given either one or two values, you might parse it:

    <rule: size_option>   
        -size <[size]> (?: x <[size]> )?

The result-hash entry for size would then always contain an array, with either one or two elements, depending on the input being parsed.

Sigue un ejemplo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n sizes.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8
 9      qr{
10        <command>
11
12        <rule: command> ls <size_option>
13
14        <rule: size_option>
15            -size <[size]> (?: x <[size]> )?
16
17        <token: size> \d+
18      }x;
19  };
20
21  while (my $input = <>) {
22      while ($input =~ m{$rbb}g) {
23          say("matches: <$&>");
24          say Dumper \%/;
25      }
26  }

Veamos su comportamiento con diferentes entradas:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 sizes.pl
ls -size 4
matches: <ls -size 4
>
$VAR1 = {
          '' => 'ls -size 4
',
          'command' => {
                         'size_option' => {
                                            '' => '-size 4
',
                                            'size' => [ '4' ]
                                          },
                         '' => 'ls -size 4
'
                       }
        };

ls -size 2x8
matches: <ls -size 2x8
>
$VAR1 = {
          '' => 'ls -size 2x8
',
          'command' => {
                         'size_option' => {
                                            '' => '-size 2x8
',
                                            'size' => [ '2', '8' ]
                                          },
                         '' => 'ls -size 2x8
'
                       }
        };

Aliasing de listas

Listifying subrules can also be given aliases, just like ordinary subrules. The alias is always specified inside the square brackets:

    <rule: size_option>   
        -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?

Here, the sizes are parsed using the pos_integer rule, but saved in the result-hash in an array under the key size.

Sigue un ejemplo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n aliasedsizes.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8
 9      qr{
10        <command>
11
12        <rule: command> ls <size_option>
13
14        <rule: size_option>
15            -size <[size=int]> (?: x <[size=int]> )?
16
17        <token: int> \d+
18      }x;
19  };
20
21  while (my $input = <>) {
22      while ($input =~ m{$rbb}g) {
23          say("matches: <$&>");
24          say Dumper \%/;
25      }
26  }

Veamos el resultado de una ejecución:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 aliasedsizes.pl
ls -size 2x4
matches: <ls -size 2x4
>
$VAR1 = {
          '' => 'ls -size 2x4
',
          'command' => {
                         'size_option' => {
                                            '' => '-size 2x4
',
                                            'size' => [
                                                        '2',
                                                        '4'
                                                      ]
                                          },
                         '' => 'ls -size 2x4
'
                       }
        };

Caveat: Cierres y Warnings

En este ejemplo aparece <number>+ sin corchetes ni paréntesis:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers5.pl 
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5  
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8  
     9      qr{
    10        <numbers>
    11  
    12        <rule: numbers> 
    13          <number>+
    14  
    15        <token: number> \d+
    16      }xms;
    17  };
    18  
    19  while (my $input = <>) {
    20      if ($input =~ m{$rbb}) {
    21          say("matches: <$&>");
    22          say Dumper \%/;
    23      }
    24  }

Este programa produce un mensaje de advertencia:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numbers5.pl 
  warn | Repeated subrule <number>+ will only capture its final match
       | (Did you mean <[number]>+ instead?)
       |

Si se quiere evitar el mensaje y se está dispuesto a asumir la pérdida de los valores asociados con los elementos de la lista se deberán poner el operando entre paréntesis (con o sin memoria).

Esto es lo que dice la documentación sobre este warning:

Repeated subrule <rule> will only capture its final match

You specified a subrule call with a repetition qualifier, such as:

        <ListElem>*

or:

        <ListElem>+

Because each subrule call saves its result in a hash entry of the same name, each repeated match will overwrite the previous ones, so only the last match will ultimately be saved. If you want to save all the matches, you need to tell Regexp::Grammars to save the sequence of results as a nested array within the hash entry, like so:

        <[ListElem]>*

or:

        <[ListElem]>+

If you really did intend to throw away every result but the final one, you can silence the warning by placing the subrule call inside any kind of parentheses. For example:

        (<ListElem>)*

or:

        (?: <ListElem> )+

Sig: Pseudo sub-reglas Sup: Análisis Sintáctico con Regexp::Grammars Ant: Renombrando los resultados de Err: Si hallas una errata ...

Casiano Rodríguez León
2012-05-22