Construyendo el AST con Expresiones Regulares 5.10

Sig: Práctica: Traducción de invitation Sup: Analisis Sintáctico con Expresiones Ant: Introducción al Anaĺisis Sintáctico Err: Si hallas una errata ...

Subsecciones

Construyendo el AST con Expresiones Regulares 5.10

Construiremos en esta sección un traductor de infijo a postfijo utilizando una aproximación general: construiremos una representación del Abstract Syntax Tree o AST (véase la sección 4.9 Árbol de Análisis Abstracto para una definición detallada de que es un árbol sintáctico).

Como la aplicación es un poco mas compleja la hemos dividido en varios ficheros. Esta es la estructura:

.
|-- ASTandtrans3.pl    # programa principal
|-- BinaryOp.pm        # clases para el manejo de los nodos del AST
|-- testreegxpparen.pl # prueba para Regexp::Paren
`-- Regexp
    `-- Paren.pm       # módulo de extensión de $^N

La salida del programa puede ser dividida en tres partes. La primera muestra una antiderivación a derechas inversa:

pl@nereida:~/Lperltesting$ ./ASTandtrans3.pl
2*(3-4)
factor -> NUM(2)
factor -> NUM(3)
rt -> empty
term-> factor rt
factor -> NUM(4)
rt -> empty
term-> factor rt
re -> empty
re -> [+-] term re
exp -> term re
factor -> ( exp )
rt -> empty
rt -> [*/] factor rt
term-> factor rt
re -> empty
exp -> term re
matches: 2*(3-4)

Que leída de abajo a arriba nos da una derivación a derechas de la cadena 2*(3-4):

exp => term re => term => factor rt => 
factor [*/](*) factor rt => factor [*/](*) factor => 
factor [*/](*) ( exp ) => factor [*/](*) ( term re ) =>  
factor [*/](*) ( term [+-](-) term re ) =>  
factor [*/](*) ( term [+-](-) term ) => 
factor [*/](*) ( term [+-](-) factor rt ) =>
factor [*/](*) ( term [+-](-) factor ) => 
factor [*/](*) ( term [+-](-) NUM(4) ) =>
factor [*/](*) ( factor rt [+-](-) NUM(4) ) => 
factor [*/](*) ( factor [+-](-) NUM(4) ) =>
factor [*/](*) ( NUM(3) [+-](-) NUM(4) )  => 
NUM(2) [*/](*) ( NUM(3) [+-](-) NUM(4) )

La segunda parte nos muestra la representación del AST para la entrada dada (2*(3-4)):

AST:
$VAR1 = bless( {
  'left' => bless( { 'val' => '2' }, 'NUM' ),
  'right' => bless( {
    'left' => bless( { 'val' => '3' }, 'NUM' ),
    'right' => bless( { 'val' => '4' }, 'NUM' ),
    'op' => '-'
  }, 'ADD' ),
  'op' => '*'
}, 'MULT' );

La última parte de la salida nos muestra la traducción a postfijo de la expresión en infijo suministrada en la entrada (2*(3-4)):

2 3 4 - *

Programa Principal: usando la pila de atributos

La gramática original que consideramos es recursiva a izquierdas:

 exp    ->   exp [-+] term
           | term
 term   ->   term [*/] factor
           | factor
 factor ->  \( exp \)
           | \d+

aplicando las técnicas explicadas en 4.8.2 es posible transformar la gramática en una no recursiva por la izquierda:

 exp       ->   term restoexp
 restoexp  ->   [-+] term restoexp
              | # vacío
 term      ->   term restoterm
 restoterm ->   [*/] factor restoterm
              | # vacío
 factor    ->   \( exp \)
              | \d+

Ahora bien, no basta con transformar la gramática en una equivalente. Lo que tenemos como punto de partida no es una gramática sino un esquema de traducción (véase la sección 4.7) que construye el AST asociado con la expresión. Nuestro esquema de traducción conceptual es algo así:

 exp    ->   exp ([-+]) term       { ADD->new(left => $exp, right => $term, op => $1) }
           | term                  { $term }
 term   ->   term ([*/]) factor    { MULT->new(left => $exp, right => $term, op => $1) } 
           | factor                { $factor }
 factor ->  \( exp \)              { $exp }
           | (\d+)                 { NUM->new(val => $1) }

Lo que queremos conseguir un conjunto de acciones semánticas asociadas para gramática no recursiva que sea equivalente a este.

Este es el programa resultante una vez aplicadas las transformaciones. La implementación de la asociación entre símbolos y atributos la realizamos manualmente mediante una pila de atributos:

pl@nereida:~/Lperltesting$ cat -n ./ASTandtrans3.pl
 1  #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1
 2  use v5.10;
 3  use strict;
 4  use Regexp::Paren qw{g};
 5  use BinaryOp;
 6
 7  use Data::Dumper;
 8  $Data::Dumper::Indent = 1;
 9
10  # Builds AST
11  my @stack;
12  my $regexp = qr{
13      (?&exp)
14
15      (?(DEFINE)
16          (?<exp>    (?&term) (?&re)
17                       (?{ say "exp -> term re" })
18          )
19
20          (?<re>     \s* ([+-]) (?&term)
21                        (?{  # intermediate action
22                            local our ($ch1, $term) = splice @stack, -2;
23
24                            push @stack, ADD->new( {left => $ch1, right => $term, op => g(1)});
25                        })
26                     (?&re)
27                       (?{ say "re -> [+-] term re" })
28                   | # empty
29                       (?{ say "re -> empty" })
30          )
31
32          (?<term>   ((?&factor)) (?&rt)
33                        (?{
34                            say "term-> factor rt";
35                        })
36          )
37
38          (?<rt>     \s*([*/]) (?&factor)
39                         (?{  # intermediate action
40                              local our ($ch1, $ch2) = splice @stack, -2;
41
42                              push @stack, MULT->new({left => $ch1, right => $ch2, op => g(1)});
43                          })
44                     (?&rt) # end of <rt> definition
45                         (?{
46                              say "rt -> [*/] factor rt"
47                          })
48                   | # empty
49                         (?{ say "rt -> empty" })
50          )
51
52          (?<factor> \s* (\d+)
53                          (?{
54                             say "factor -> NUM($^N)";
55                             push @stack, bless { 'val' => g(1) }, 'NUM';
56                          })
57                     | \s* \( (?&exp) \s* \)
58                          (?{ say "factor -> ( exp )" })
59          )
60      )
61  }xms;
62
63  my $input = <>;
64  chomp($input);
65  if ($input =~ $regexp) {
66    say "matches: $&";
67    my $ast = pop @stack;
68    say "AST:\n", Dumper $ast;
69
70    say $ast->translate;
71  }
72  else {
73    say "does not match";
74  }

Las Clases representando a los AST

Cada nodo del AST es un objeto. La clase del nodo nos dice que tipo de nodo es. Así los nodos de la clase MULT agrupan a los nódos de multiplicación y división. Los nodos de la clase ADD agrupan a los nódos de suma y resta. El procedimiento general es asociar un método translate con cada clase de nodo. De esta forma se logra el polimorfismo necesario: cada clase de nodo sabe como traducirse y el método translate de cada clase puede escribirse como

Obtener los resultados de llamar a $child->translate para cada uno de los nodos hijos $child. Por ejemplo, si el nodo fuera un nodo IF_ELSE de un hipotético lenguaje de programación, se llamaría a los métodos translate sobre sus tres hijos boolexpr, ifstatement y elsestatement.
Combinar los resultados para producir la traducción adecuada del nodo actual.

Es esta combinación la que mas puede cambiar según el tipo de nodo. Así, en el caso de el nodo IF_ELSE el seudocódigo para la traducción sería algo parecido a esto:

my $self = shift;
my $etiqueta1 = generar_nueva_etiqueta;
my $etiqueta2 = generar_nueva_etiqueta;

my $boolexpr      = $self->boolexpr->translate;
my $ifstatement   = $self->ifstatement->translate,  
my $elsestatement = $self->elsestatement->translate, 
return << "ENDTRANS";
    $boolexpr
    JUMPZERO $etiqueta1:
    $ifstatement
    JUMP     $etiqueta2:
  $etiqueta1:
    $elsestatement
  $etiqueta2:
ENDTRANS

Siguiendo estas observaciones el código de BinaryOp.pm queda así:

pl@nereida:~/Lperltesting$ cat -n BinaryOp.pm
 1  package BinaryOp;
 2  use strict;
 3  use base qw(Class::Accessor);
 4
 5  BinaryOp->mk_accessors(qw{left right op});
 6
 7  sub translate {
 8    my $self = shift;
 9
10    return $self->left->translate." ".$self->right->translate." ".$self->op;
11  }
12
13  package ADD;
14  use base qw{BinaryOp};
15
16  package MULT;
17  use base qw{BinaryOp};
18
19  package NUM;
20
21  sub translate {
22    my $self = shift;
23
24    return $self->{val};
25  }
26
27  1;

Véase también:

Class::Accessor

Accediendo a los paréntesis lejanos: El módulo `Regexp::Paren`

En esta solución utilizamos las variables @- y @+ para construir una función que nos permite acceder a lo que casó con los últimos paréntesis con memoria:

Since Perl 5.6.1 the special variables @- and @+ can functionally replace $`, $& and $'. These arrays contain pointers to the beginning and end of each match (see perlvar for the full story), so they give you essentially the same information, but without the risk of excessive string copying.

Véanse los párrafos en las páginas , ) y para mas información sobre @- y @+.

g(1) nos retorna lo que casó con el último paréntesis, g(2) lo que casó con el penúltimo, etc.

pl@nereida:~/Lperltesting$ cat -n Regexp/Paren.pm
 1  package Regexp::Paren;
 2  use strict;
 3
 4  use base qw{Exporter};
 5
 6  our @EXPORT_OK = qw{g};
 7
 8  sub g {
 9    die "Error in 'Regexp::Paren::g'. Not used inside (?{ code }) construct\n" unless defined($_);
10    my $ofs = - shift;
11
12    # Number of parenthesis that matched
13    my $np = @-;
14    die "Error. Illegal 'Regexp::Paren::g' ref inside (?{ code }) construct\n" unless ($np > - $ofs && $ofs < 0);
15    # $_ contains the string being matched
16    substr($_, $-[$ofs], $+[$np+$ofs] - $-[$ofs])
17  }
18
19  1;
20
21  =head1 NAME
22
23  Regexp::Paren - Extends $^N inside (?{ ... }) constructs
24
25  =head1 SYNOPSIS
26
27    use Regexp::Paren qw{g};
28
29    'abcde' =~ qr{(.)(.)(.)
30                         (?{ print g(1)." ".g(2)." ".g(3)."\n" })                   # c b a
31                 (.)     (?{ print g(1)." ".g(2)." ".g(3)." ".g(4)."\n" })          # d c b a
32                 (.)     (?{ print g(1)." ".g(2)." ".g(3)." ".g(4)." ".g(5)."\n" }) # e d c b a
33                }x;
34
35    print g(1)." ".g(2)." ".g(3)." ".g(4)." ".g(5)."\n"; # error!
36
37  =head1 DESCRIPTION
38
39  Inside a C<(?{ ... })> construct, C<g(1)> refers to what matched the last parenthesis
40  (like C<$^N>), C<g(2)> refers to the string that matched with the parenthesis before
41  the last, C<g(3)> refers to the string that matched with the parenthesis at distance 3,
42  etc.
43
44  =head1 SEE ALSO
45
46  =over 2
47
48  =item * L<perlre>
49
50  =item * L<perlretut>
51
52  =item * PerlMonks node I<Strange behavior o> C<@-> I<and> C<@+> I<in perl5.10 regexps> L<http://www.perlmonks.org/?node_id=794736>
53
54  =item * PerlMonks node I<Backreference variables in code embedded inside Perl 5.10 regexps> L<http://www.perlmonks.org/?node_id=794424>
55
56  =back
57
58  =head1 AUTHOR
59
60  Casiano Rodriguez-Leon (casiano@ull.es)
61
62  =head1 ACKNOWLEDGMENTS
63
64  This work has been supported by CEE (FEDER) and the Spanish Ministry of
65  I<Educacion y Ciencia> through I<Plan Nacional I+D+I> number TIN2005-08818-C04-04
66  (ULL::OPLINK project L<http://www.oplink.ull.es/>).
67  Support from Gobierno de Canarias was through GC02210601
68  (I<Grupos Consolidados>).
69  The University of La Laguna has also supported my work in many ways
70  and for many years.
71
72  =head1 LICENCE AND COPYRIGHT
73
74  Copyright (c) 2009- Casiano Rodriguez-Leon (casiano@ull.es). All rights reserved.
75
76  These modules are free software; you can redistribute it and/or
77  modify it under the same terms as Perl itself. See L<perlartistic>.
78
79  This program is distributed in the hope that it will be useful,
80  but WITHOUT ANY WARRANTY; without even the implied warranty of
81  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Al ejecutar perldoc Regexp::Paren podemos ver la documentación incluida (véase la documentación en perlpod y perlpodspec así como la sección La Documentación en Perl para mas detalles):

NAME Regexp::Paren - Extends $^N inside (?{ ... }) constructs SYNOPSIS use Regexp::Paren qw{g}; 'abcde' =~ qr{(.)(.)(.) (?{ print g(1)." ".g(2)." ".g(3)."\n" }) # c b a (.) (?{ print g(1)." ".g(2)." ".g(3)." ".g(4)."\n" }) # d c b a (.) (?{ print g(1)." ".g(2)." ".g(3)." ".g(4)." ".g(5)."\n" }) # e d c b a }x; print g(1)." ".g(2)." ".g(3)." ".g(4)." ".g(5)."\n"; # error! DESCRIPTION Inside a "(?{ ... })" construct, g(1) refers to what matched the last parenthesis (like $^N), g(2) refers to the string that matched with the parenthesis before the last, g(3) refers to the string that matched with the parenthesis at distance 3, etc. SEE ALSO * perlre * perlretut * PerlMonks node *Strange behavior o* "@-" *and* "@+" *in perl5.10 regexps* <http://www.perlmonks.org/?node_id=794736> * PerlMonks node *Backreference variables in code embedded inside Perl 5.10 regexps* <http://www.perlmonks.org/?node_id=794424> AUTHOR Casiano Rodriguez-Leon (casiano@ull.es) ACKNOWLEDGMENTS This work has been supported by CEE (FEDER) and the Spanish Ministry of *Educacion y Ciencia* through *Plan Nacional I+D+I* number TIN2005-08818-C04-04 (ULL::OPLINK project <http://www.oplink.ull.es/>). Support from Gobierno de Canarias was through GC02210601 (*Grupos Consolidados*). The University of La Laguna has also supported my work in many ways and for many years. LICENCE AND COPYRIGHT Copyright (c) 2009- Casiano Rodriguez-Leon (casiano@ull.es). All rights

Sig: Práctica: Traducción de invitation Sup: Analisis Sintáctico con Expresiones Ant: Introducción al Anaĺisis Sintáctico Err: Si hallas una errata ...

Casiano Rodríguez León
2012-05-22

Construyendo el AST con Expresiones Regulares 5.10

Programa Principal: usando la pila de atributos

Las Clases representando a los AST

Accediendo a los paréntesis lejanos: El módulo Regexp::Paren

Accediendo a los paréntesis lejanos: El módulo `Regexp::Paren`