Back to PyBison Homepage

PyBison Walkthrough

0. Introduction

This document aims to get you up to speed with PyBison in the fastest possible time, by walking you through the motions of using it, and supporting the explanations with an example.

NOTE - recent versions of flex violate the ANSI standards.

If any of the pyBison examples fail to build, remove the following line from the lex code portion of your scripts:
int yylineno = 0;
Also, make sure your system is capable of looking in the local directory when trying to load .so files. If you see any errors like failed to load somefilename.so, just add "." to LD_LIBRARY_PATH, or execute the command:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.

1. Procure Grammar and Scanner Scripts

The best place to start in building your PyBison parser is to write a grammar (.y) script, and a scanner (.l) script.

(Or, if possible, source these scripts from other (open source) projects).

Once you're familiar with the layout of PyBison parser python modules, you can skip this step, and start building each new parser from scratch, or fram a template (refer to the examples/template directory).

1.1. Introducing the bison2py Utility

bison2py munges your new (or legacy) grammar (.y) and scanner (.l) files, and generates a new Python file containing classes and unit test code for your PyBison parser.

To see bison2py in action, go into the examples/java directory, read the README file, and generate a javaparser.py file from javaparser.y and javaparser.l scripts.

Study the generated javaparser.py file - it's especially useful from a point of view of seeing what's good to put into a pybison python parser file, especially when writing your own.

In fact, when starting a new parser project, you might like to start by writing .y and .l files yourself, and repeatedly:
  1. Edit these files
  2. Generate a parser from them with bison2py
  3. Test the parser rigorously against a whole range of inputs
  4. Remove the grammar and scanner errors as you find them
  5. Repeat these steps as often as needed till you have a bug-free parser and scanner
We suggest you may have a far easier time if you ensure you have a bug-free parser script before even beginning to edit your target handler and parse node methods.

Once you've got a stable parser, you'll have a structure to work from. You'll then be free to discard or archive your .y and .l files, and tweak the grammar and scanner by editing the target handler docstrings and scanner script attributes, respectively.

1.2. Preparing Your Grammar File

If you're using an existing .y file (perhaps sourced from another project), you'll need to massage it a bit to get it into a state where you can process it automatically with bison2py.

In summary, you'll need to:

1.2.1. Strip Out Comments and Actions

With your grammar (.y) file, you'll need to strip out all action statements and comments from the rules section.

For instance, if you're using a legacy grammar file, you'll need to convert rules like:
        expr: expr PLUS expr
                { $$ = $1 + $3; } /* add the numbers now */
             |expr MINUS expr
                { $$ = $1 - $3; }
             ;
to:
        expr : expr PLUS expr
           | expr MINUS expr
           ;
or:
        expr
        : expr PLUS expr
        | expr MINUS expr
        ;
depending on what style meets your taste.

The reason for this is that your pybison script will receive callbacks every time a parse target is reached, which is done by automatically appending special action code to each rule clause. If you don't remove all action statements, the conversion will fail.

1.2.2. Replace All Character Literals in Rules

Within the PyBison-generated parser, all targets and tokens are rendered as Python objects (for people familiar with the Python/C API, type PyObject *)

Therefore, you unfortunately lose the convenience of being able to deal in C character literals in your rules.

For instance, with a rule like:
        expr : expr '+' expr
           | expr '-' expr
           ;
you'll have to replace the '+' and '-' char literals to abstract tokens, and ensure that your scanner script returns Python-wrapped tokens for these operators. You should end up with a rule like:
        expr : expr PLUS expr
           | expr MINUS expr
           ;
And you'll need to ensure your scanner script does a returntoken(PLUS); and returntoken(MINUS); for '+' and '-' respectively.

1.2.3. Enclose Rule Delimiters in Whitespace

You need to ensure that the delimiters :, | and ; delimiters used in your rules have at least one whitespace character on either side. Sorry about this, but this version of PyBison has some quirks in the regular expressions used for extracting/dissecting the rules, and bison2py (or the resultant parser) may fail if you don't follow this step.
And also, you'll need to have a:
%{ ... %}
section in the prologue (before the first %%).

1.3. Preparing Your Tokeniser File

In addition to parse targets callbacks, PyBison has an input callback, so your Parser object will have control over the input that is sent to the lexer.

You'll have to set up your tokeniser to use this callback mechanism, and also to wrap tokens as Python objects.

To set this up, ensure the following lines are in the C declarations sections of your lex/flex script:
    %{
    #include <stdio.h>
    #include <string.h>
    #include "Python.h"
    #define YYSTYPE void *
    #include "tokens.h"
    extern void *py_parser;
    extern void (*py_input)(PyObject *parser, char *buf, int *result, int max_size);
    #define returntoken(tok) yylval = PyString_FromString(strdup(yytext)); return (tok);
    #define YY_INPUT(buf,result,max_size) {(*py_input)(py_parser, buf, &result, max_size);}
    }%
<quick-diversion>
Let's explain each of these lines now:
    #include <stdio.h>
    #include <string.h>
    #include "Python.h"
Include the standard stdio.h and string.h headers, as well as the Python-C API file Python.h.
    #define YYSTYPE void *
All parse targets and tokens are actually of type PyObject *, or 'pointer to Python object', but neither bison nor flex-generated code need to know this. We'll just give them opaque pointers, and void * will suffice just fine.
    #include "tokens.h"
When PyBison first instantiates any given parser class (and auto-generates, processes, compiles, links the grammar/scanner files into a dynamic lib), the bison program generates a header file of token definitions, which gets renamed to tokens.h. Your scanner script will need this file, so the token macros will be defined and resolved to the correct token numbers.
    extern void (*py_input)(PyObject *parser, char *buf, int *result, int max_size);
    extern void *py_parser;
    #define YY_INPUT(buf,result,max_size) {(*py_input)(py_parser, buf, &result, max_size);}
These lines activate the input callback mechanism. Whenever the scanner needs more input, it will call a global function called py_input(), which forwards the callback to your Python Parser's .read(nbytes) method.
Note that if you want your scanner to use a different source of input (eg, a live TCP socket connection), you can override this method in your parser class, or pass a read=myreadfunction keyword argument when instantiating your parser (myreadfunction should be a callable accepting a single argument nbytes, being the maximum number of bytes to retrieve, and returning a string).
    #define returntoken(tok) yylval = PyString_FromString(strdup(yytext)); return (tok);
A macro which wraps all tokens values as Python strings, so your parser target handlers can uplift the original input text which constitutes that token.
</quick-diversion>

Defining the YY_INPUT C macro tells flex to invoke a callback every time it needs input, so your Parser class' .read() method will have control over what the lexer receives.

Now, you'll need to change all the return statements in your token targets to use returntoken() instead. For example, change:
"(" { return LPAREN; }
to:
"(" { returntoken(LPAREN); }
Lastly, in the epilogue of your lexer file (ie, after the second '%%' line), you'll need to add a line like:
    yywrap() { return(1); }
    

1.4. Doing The Conversion

When you're sure you've got your .y and .l files prepared properly, you can generate the .py file, which will contain your pyBison Parser class.

To do this conversion, run the command:
bison2py mybisonfile.y myflexfile.l mypythonfile.py
where mybisonfile.y is your grammar file, with bison/yacc declarations, myflexfile.l is your tokeniser script, with flex/lex declarations, and mypythonfile.py is the name of the python file you want to generate.

You should now see a file mypythonfile.py which contains a couple of import statements, plus a declaration of a class called Parser.
If your grammar is large and complex, you should consider adding a -c argument to the bison2py command.

This will cause the mypythonfile.py file to be generated with a bunch of parse node subclasses, one per parse target, and with each grammar target handler method instantiating its respective parse node class, rather than the default pybison.BisonNode class.

Also, it'll generate a ParseNode class (derived from pybison.BisonNode, from which all these target-specific node classes are derived.
This can be extremely handy, because you can add a bunch of methods to the ParseNode class, and optionally override these in your per-target node classes. Also, override the constructor and/or the existing .dump() method in this class or the per-target classes.

2. Prepare Your Parser Class

Now, we focus on creation of a working parser.

Note here that we will be creating the parser .py file by hand from scratch - not the preferred approach, but chosen here as an alternative to deriving a parser module boilerplate as discussed in the previous chapter.

To make this easy, we will use a simple calculator example.

Create a new python file, perhaps mycalc.py, and follow these steps:

2.1. Required Imports

You will need at least the following imports:
from bison import BisonParser, BisonNode
BisonParser is the base class from which you derive your own Parser class.

BisonNode is a convenient wrapper for containing the contents of parse targets, and can assist you in building your parse tree.


2.2. Devise Your Grammar

We'll base our example on the Calculator example from the standard bison/yacc manual. Note that we won't use exactly the same token names:
    %token NUM
    %left '-' '+'
    %left '*' '/'
    %left NEG     /* negation--unary minus */
    %right '^'    /* exponentiation        */
    
    /* Grammar follows */
    %%
    input:    /* empty string */
            | input line
    ;
    
    line:     '\n'
            | exp '\n'  { printf ("\t%.10g\n", $1); }
    ;
    
    exp:      NUM                { $$ = $1;         }
            | exp '+' exp        { $$ = $1 + $3;    }
            | exp '-' exp        { $$ = $1 - $3;    }
            | exp '*' exp        { $$ = $1 * $3;    }
            | exp '/' exp        { $$ = $1 / $3;    }
            | '-' exp  %prec NEG { $$ = -$2;        }
            | exp '^' exp        { $$ = pow ($1, $3); }
            | '(' exp ')'        { $$ = $2;         }
    ;
    
However, in PyBison, you don't dump all this into a script - you declare the grammar items one by one in methods of your class.

2.3. Create Skeleton Parser Class

In your calc.py file, you've already done the required imports, so now you can create your skeleton class declaration:
    class Parser(BisonParser):
    
        pass
    

2.4. Declare the Tokens

Now, it's time to declare our tokens. To do this, we add to our class an attribute called tokens which contains a list of our tokens.

Our class now looks like this:
    class Parser(BisonParser):
        tokens = ['NUMBER',
                  'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'POW',
                  'LPAREN', 'RPAREN',
                  'NEWLINE', 'QUIT',
                  ]
    

2.5. Declare the Precedences

To resolve ambiguities in our grammar, we need to declare which entities have precedence, and the associativity (left/right) of these entities.

We adapt this from the example, and add it as an attribute precedences.

Our class now looks like this:
    class Parser(BisonParser):
    
        tokens = ['NUMBER',
                  'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'POW',
                  'LPAREN', 'RPAREN',
                  'NEWLINE', 'QUIT',
                  ]
        precedences = (
            ('left', ('MINUS', 'PLUS')),
            ('left', ('TIMES', 'DIVIDE')),
            ('left', ('NEG', )),
            ('right', ('POW', )),
            )

2.6. Declare the Start Symbol

As you can see from studying the grammar above, the topmost entity is line. We need to tell PyBison to use this, by adding an attribute called start.
Our class now looks like this:
    class Parser(BisonParser):
    
        tokens = ['NUMBER',
                  'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'POW',
                  'LPAREN', 'RPAREN',
                  'NEWLINE', 'QUIT',
                  ]
    
        precedences = (
            ('left', ('MINUS', 'PLUS')),
            ('left', ('TIMES', 'DIVIDE')),
            ('left', ('NEG', )),
            ('right', ('POW', )),
            )
        start = 'input'

2.7. Add Rules Callbacks

This is the fun part. We add a method to our class for each of the parse targets.

For each parse target sometarget, we need to provide a method called:
on_sometarget(self, target, option, names, items)
Each such callback method accepts the arguments:
  • target - string - the name of the target - passed in mainly as a convenience for when you're debugging your grammar.
  • option - int - a numerical index indicating which 'clause' matched the target. For example, given a rule:
                exp : NUMBER
                    | exp PLUS exp
                    | exp MINUS exp
    If we have matched the expression 3 + 6, the option argument will be 1, because the clause exp PLUS exp occurs at position 1 in the list of rule clauses.
  • names - list of strings, being names of the terms in the matching clause. For example, with the above rule, the expression 3 + 6 would produce a names list ['exp', 'PLUS', 'exp']
  • items - list - a list of objects, being the values of the items in the matching clause. Each item of this list will (in the case of token matches), be a literal string of the token, or (in the case of previously handled parse targets), whatever your parse target handler happened to return previously. For instance, in the 3 + 6 example, assuming your on_exp() handler returns a float value, this list would be [3.0, '+', 6.0]
We must now note a major difference from traditional yacc/bison. In yacc/bison, we provide { action-stmts;... } action blocks after each rule clause. But with pyBison, the one parse target callback handles all possible clauses for that target. The option argument indicates which clause actually matched.

Now, with this explanation out of the way, we can get down to the business of actually writing our callbacks.

Our class now looks like:
    class Parser(BisonParser):
    
        tokens = ['NUMBER',
                  'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'POW',
                  'LPAREN', 'RPAREN',
                  'NEWLINE', 'QUIT',
                  ]
    
        precedences = (
            ('left', ('MINUS', 'PLUS')),
            ('left', ('TIMES', 'DIVIDE')),
            ('left', ('NEG', )),
            ('right', ('POW', )),
            )
    
        start = 'input'
        def on_input(self, target, option, names, items):
            """
            input :
                  | input line
            """
            return
    
        def on_line(self, target, option, names, items):
            """
            line : NEWLINE
                 | exp NEWLINE
            """
            if option == 1:
                print "on_line: got exp %s" % items[0]
    
        def on_exp(self, target, option, names, items):
            """
            exp : NUMBER
                | exp PLUS exp
                | exp MINUS exp
                | exp TIMES exp
                | exp DIVIDE exp
                | MINUS exp %prec NEG
                | exp POW exp
                | LPAREN exp RPAREN
            """
            if option == 0:
                return float(items[0])
            elif option == 1:
                return items[0] + items[2]
            elif option == 2:
                return items[0] - items[2]
            elif option == 3:
                return items[0] * items[2]
            elif option == 4:
                return items[0] / items[2]
            elif option == 5:
                return - items[1]
            elif option == 6:
                return items[0] ** items[2]
            elif option == 7:
                return items[1]
    
Note one important thing here - the rules, declared in our docstrings, are not terminated by a semicolon. This is not needed (as in traditional yacc), because the rules are separated into separate handler method docstrings, rather than being lumped in together.

So don't put a semicolon in your grammar rule docstrings, or Bad Things might happen.

2.8. Add Flex Script

Finally, we must tell pyBison how to carve up the input into tokens.

Instead of having a separate flex or lex script, we embed the script verbatim as attribute lexscript.

NOTE - you should provide this script as a Python raw string (r""")

We'll use here a simple flex script which simply recognises numbers, the '+', '-', '*', '/', '**' operators, and parentheses.

For our lexer to work, it will need a C declarations section with the magic lines:
    %{
    int yylineno = 0;
    #include <stdio.h>
    #include <string.h>
    #include "Python.h"
    #define YYSTYPE void *
    #include "tokens.h"
    extern void *py_parser;
    extern void (*py_input)(PyObject *parser, char *buf, int *result, int max_size);
    #define returntoken(tok) yylval = PyString_FromString(strdup(yytext)); return (tok);
    #define YY_INPUT(buf,result,max_size) { (*py_input)(py_parser, buf, &result, max_size); }
    %}
    
(refer to Section 1.3 above for an explanation of these declarations).

Our completed Parser class declaration now looks like this:
    class Parser(BisonParser):
    
        tokens = ['NUMBER',
                  'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'POW',
                  'LPAREN', 'RPAREN',
                  'NEWLINE', 'QUIT']
    
        precedences = (
            ('left', ('MINUS', 'PLUS')),
            ('left', ('TIMES', 'DIVIDE')),
            ('left', ('NEG', )),
            ('right', ('POW', )),
            )
    
        def read(self, nbytes):
            try:
                return raw_input("> ")
            except EOFError:
                return ''
    
        # Declare the start target here (by name)
        start = "input"
    
        def on_input(self, target, option, names, items):
            """
            input :
                  | input line
            """
            return
    
        def on_line(self, target, option, names, items):
            """
            line : NEWLINE
                 | exp NEWLINE
            """
            if option == 1:
                print items[0].value
    
        def on_exp(self, target, option, names, items):
            """
            exp : NUMBER
                | exp PLUS exp
                | exp MINUS exp
                | exp TIMES exp
                | exp DIVIDE exp
                | MINUS exp %prec NEG
                | exp POW exp
                | LPAREN exp RPAREN
            """
            if option == 0:
                return float(items[0])
            elif option == 1:
                return items[0] + items[2]
            elif option == 2:
                return items[0] - items[2]
            elif option == 3:
                return items[0] * items[2]
            elif option == 4:
                return items[0] / items[2]
            elif option == 5:
                return - items[1]
            elif option == 6:
                return items[0] ** items[2]
            elif option == 7:
                return items[1]
    
        lexscript = r"""
    %{
    int yylineno = 0;
    #include <stdio.h>
    #include <string.h>
    #include "Python.h"
    #define YYSTYPE void *
    #include "tokens.h"
    extern void *py_parser;
    extern void (*py_input)(PyObject *parser, char *buf, int *result, int max_size);
    #define returntoken(tok) yylval = PyString_FromString(strdup(yytext)); return (tok);
    #define YY_INPUT(buf,result,max_size) { (*py_input)(py_parser, buf, &result, max_size); }
    %}
    
    %%
    
    [0-9]+ { returntoken(NUMBER); }
    "("    { returntoken(LPAREN); }
    ")"    { returntoken(RPAREN); }
    "+"    { returntoken(PLUS); }
    "-"    { returntoken(MINUS); }
    "*"    { returntoken(TIMES); }
    "**"   { returntoken(POW); }
    "/"    { returntoken(DIVIDE); }
    "quit" { printf("lex: got QUIT\n"); yyterminate(); returntoken(QUIT); }
    
    [ \t\v\f]             {}
    [\n]		{yylineno++; returntoken(NEWLINE); }
    .       { printf("unknown char %c ignored\n", yytext[0]); /* ignore bad chars */}
    
    %%
    
    int yywrap() { return(1); }
    """
    
NOTE - if you are using recent versions of flex (ie, the ones which violate the ANSI standards for lex/flex), you'll have to change the lexing code above; removing the line int yylineno = 0;


Note that we've sneaked in an additional method, .read(self, nbytes). This is another callback that gets invoked by the lexer whenever it needs more input. (quick tip - in your mycalc.py file, do an 'import readline', so you get line editing and recall when the parser runs).

This gives a lot of flexibility, because our Parser class gets to control exactly where its input comes from - file, or a string, socket, whatever.


2.9. Write Runner Script

One quick last thing to do here - we just need a tiny script (say, 'runcalc.py'), to import our Parser class and run it:
    #!/usr/bin/env python
    import mycalc
    p = mycalc.Parser()
    p.run()
    
There's a specific reason why we do this - if we made our mycalc.py script executable, then when we first instantiate our Parser class, PyBison will guess a name for the dynamic library to create. If running mycalc.py directly, then self.__class__.__module__ will be '__main__', and our dynamic library would be created with the name __main__-parser.so, which is pretty ugly. You could force a name for the library file by declaring an additional attribute in the Parser class:
    bisonEngineLibName = "mycalc-parser"
    
Oh, and don't forget to chmod the script to be executable.

3. Run The Parser

We're now ready to run our completed parser.

Given that you have created the files mycalc.py and runcalc.py in the current directory, and that you've already installed PyBison (refer INSTALL file), you'll be set to go.

From your shell, just type:
      $ ./runcalc.py
    
The first time you run this parser, it might make a lot of compilation-type noises. For example, my aging Debian-based system produces:
    In file included from /usr/include/python2.3/Python.h:8,
                     from tmp.l:6:
    /usr/include/python2.3/pyconfig.h:847:1: warning: "_POSIX_C_SOURCE" redefined
    In file included from /usr/include/stdio.h:28,
                     from tmp.lex.c:11:
    /usr/include/features.h:171:1: warning: this is the location of the previous definition
    
All this relates to a bit of black magic which is happening in the background.

The first time you instantiate your mycalc.Parser class, the bison.BisonParser base class tries to load the dynamic library mycalc-parser.so (or, on windows, mycalc-parser.dll).

If the library file is not present (or if it is out of date, determined from hashing handler docstrings and pertinent attributes in the class), PyBison attempts to build it.

To build this library, PyBison:

4. Miscellaneous Remarks

Just a few quick notes, to cover some of the possible gotchas.

4.1. Plurality

In the present version of PyBison, you may only have one instance of any given Parser class actually running at any one time. This is because the present version of PyBison makes use of a couple of global C variables to store hooks into your Parser instance.

However, you can have multiple instances existing at the same time.

Also, you can have several parsers running at the same time, as long as they are each instantiated from different Parser classes.

4.2. Building A Parse Tree

The .run() method of your parser object returns whatever your handler for the top-level target returned.

Building a whole parse tree is pretty simple.

Within each parse target handler callback (your .on_whateverTarget() methods), you need to create a new BisonNode instance, and store the component items (the items argument, or whatever you want to extract from items) as attributes, then return this BisonNode object.

Then, with the BisonNode object returned from your parser's .run() method, you'll be able to traverse the tree of the entire parse run.

5. Conclusion

Through this document, we have started from scratch, and created and used a complete, working parser.

We have presented the options of starting with existing .y and .l scripts and converting them to a boilerplate PyBison .py file, versus writing your own python parser file from scratch.

We have covered the requirements for building Parser classes, the attributes and methods you need to declare.

We have discussed the callback model, whereby instances of your Parser class receive callbacks from PyBison whenever input is required, and whenver a parse target has been unambiguously reached.

We have briefly discussed how PyBison derives grammar and tokeniser scripts from the contents of our Parser class, and how PyBison runs bison and flex on these scripts, compiles the output, and links the result into a shared library, which can be used in subsequent uses of the Parser to get almost the full speed of C-based code, from the comfort and convenience of the Python environment.

And also, we have briefly mentioned how to use PyBison to build up a parse tree as an easily-traversed Python data structure.

We hope this document has got you up to speed without undue head-scratching, and that you're now starting to get a feel for designing and building your own parsers.

David McNab