gemlang.rst 10.7 KB
Newer Older
Chris Jewell's avatar
Chris Jewell committed
1
2
.. _gemlang_doc:

Chris Jewell's avatar
Chris Jewell committed
3
4
5
6
7
8
9
10
11
12
gemlang: The GEM modelling language
===================================

The GEM modelling language, **gemlang**, is defined using
`extended Backus-Naur Form <https://en.wikipedia.org/wiki/Extended_BackusNaur_form>`_ (eBNF) notation and parsed
using the Python parser generator package `Lark <https://github.com/lark-parser/lark>`_.  The basic architecture of the
GEM parsing framework closely follows the source-to-source translation ideas outlined in [Par2010]_ and shown
in outlined in :ref:`Figure 1 <parsechain>`.

.. _parsechain:
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
.. mermaid::
    :alt: GEM parse chain diagram
    :align: center
    :caption: Figure 1: The GEM parse chain showing the main stages of the source-to-source translation pipeline.

    graph LR;
        gemlang --> lex(Lexical<br />analysis);
        lex --parse tree--> syntax(Syntactic<br />analysis);
        syntax --AST--> semantic(Semantic<br />analysis);
        semantic --AST--> code(Code<br />generator);
        symboltable[Symbol<br />Table] --> semantic
        symboltable --> code
        code --> Python
        Lark --> lex
        Lark --> syntax
Chris Jewell's avatar
Chris Jewell committed
28
29


Chris Jewell's avatar
Chris Jewell committed
30
31
Lexical and Syntactic Analysis
------------------------------
Chris Jewell's avatar
Chris Jewell committed
32
33
34
35
36
37
The GEM program syntax is defined using an eBNF-like grammar, which can be read by the
`Lark <https://github.com/lark-parser/lark>`_ parsing engine.  eBNF is a syntax that describes a language syntax in
terms of *production rules*, which may be arranged in a hierarchy.

Given a gemlang program, lexical analysis and the first stage of syntactic analysis are performed by
`Lark <https://github.com/lark-parser/lark>`_.  On invocation, Lark reads the gemlang eBNF grammar definition
Chris Jewell's avatar
Chris Jewell committed
38
in `gem/gemlang/gem_grammar.cfgr`, and lexes and parses an input gemlang program to a *parse tree* as described in
Chris Jewell's avatar
Chris Jewell committed
39
the Lark documentation -- one tree node per production rule in the grammar.
40
41
Parsing is done using Lark's implementation of the `Earley <https://en.wikipedia.org/wiki/Earley_parser>`_ algorithm,
providing a robust and powerful method of parsing against the gemlang grammar.
Chris Jewell's avatar
Chris Jewell committed
42
43

In the second stage of syntactic analysis, the parse tree is then *transformed* into an
44
45
Abstract Syntax Tree (AST) representing the GEM program, with objects of (base) type
:class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>` representing nodes
Chris Jewell's avatar
Chris Jewell committed
46
47
in the tree.  The reason for doing this is that the parse tree is entirely homogeneous, using Lark's built in `Tree`
class to represent nodes.  For purposes of semantic analysis, we find it easier to work with specialisations of
48
49
50
51
52
:class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>` (:class:`Number <gem.gemlang.ast.ast_expression.Number>`,
:class:`MulExpr <gem.gemlang.ast.ast_expression.MulExpr>`, :class:`Call <gem.gemlang.ast.ast_expression.Call>`, etc.) so that
we can use Python's type system to identify operations and objects
represented within the AST.  This has the important advantage of decoupling parse tree generation from subsequent
semantic analysis, leading to a more modularised software architecture.
Chris Jewell's avatar
Chris Jewell committed
53
54
55

Where in the source code does this happen?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
56
57
58
59
60
61
62
63
64
Lark is invoked via the :class:`GEM <gem.interface.GEM>` class, which in turn calls  the
:func:`gemparse <gem.gemlang.parse_gemlang.gemparse>` function.
:func:`gemparse <gem.gemlang.parse_gemlang.gemparse>` returns an AST, a branching tree composed of objects of type
:class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>`.  :class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>` is specialised
into subclasses representing language features and concepts within gemlang, as
specified in the type hierarchy in :mod:`gem.gemlang.ast`.

Importantly, the transformation of the Lark parse tree to AST is implemented by
:class:`GEMParser <gem.gemlang.parse_gemlang.GEMParser>`.
Chris Jewell's avatar
Chris Jewell committed
65
A unit test (`tests/unit/test_parse_completeness`) scans the GEM grammar for production rules, and makes sure a
66
67
complementary method exists in GEMParser.  In addition, :class:`GEMParser <gem.gemlang.parse_gemlang.GEMParser>` will
throw an exception if a production rule is invoked in the grammar for which no method is implemented.
Chris Jewell's avatar
Chris Jewell committed
68

Chris Jewell's avatar
Chris Jewell committed
69
Abstract syntax tree
70
^^^^^^^^^^^^^^^^^^^^
Chris Jewell's avatar
Chris Jewell committed
71
The `AST <https://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ is a tree representation of a GEM program.  Nodes within
72
73
74
75
the tree are objects of subclasses of :class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>` representing statements, atoms, and expressions within gemlang.  The
tree may be traversed using depth-first or breadth-first using specialisations of the
:class:`ASTWalker <gem.gemlang.ast_walker.ASTWalker>`
class (see developer API documentation for the :mod:`gem.gemlang.ast` module).
Chris Jewell's avatar
Chris Jewell committed
76
77
78
79
80
81

Semantic Analysis
-----------------
The topic of general semantic analysis is broad, and the reader is encouraged to read at least [Par2010]_.  This
documentation will describe the semantic analysis steps currently performed in GEM.  As the GEM project develops, we
expect to add more steps in semantic analysis so the following description should be regarded as non-exhaustive!
Chris Jewell's avatar
Chris Jewell committed
82

Chris Jewell's avatar
Chris Jewell committed
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
To illustrate our description of semantic analysis, it is useful to consider a GEM code fragment:

.. code-block::
    :linenos:
    :name: gemprog
    :caption: Example GEM program implementing an SI model

    mu = 0.0
    beta ~ Normal(mu, 1.0)
    Epidemic SIModel() {
        S = State(init=999)
        I = State(init=1)
        [S -> I] = beta * I
    }
    epi ~ SIModel()

Semantic analysis currently consists of 3 stages which are executed sequentially:

101
102
1. Symbol Declaration (:class:`ParseDeclarations <gem.gemlang.symbol_resolve.ParseDeclarations>`)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Chris Jewell's avatar
Chris Jewell committed
103
In this stage, a symbol table is built containing builtin GEM symbols as well as symbols declared in the GEM program.
Chris Jewell's avatar
Chris Jewell committed
104
Since gemlang is implicitly typed, symbols representing variables are declared when they are first assigned.
105
In :ref:`Example 1 <gemprog>`, symbols `mu`, `beta`, `S`, `I`, and `epi` are pushed into the symbol table on
Chris Jewell's avatar
Chris Jewell committed
106
107
their first assignment (lines 1, 2, 4, 5, 8 respectively).

Chris Jewell's avatar
Chris Jewell committed
108
109
110
In GEM, symbols are represented by the :class:`Symbol <gem.gemlang.symbol.Symbol>` class hierarchy.  Symbol tables are
represented by the :class:`Scope <gem.gemlang.symbol.Scope>` class hierarchy, which is essentially a wrapper around a
Python `dict` object storing *name: symbol* pairs.  Symbols can also have scopes themselves, representing declarations
111
112
such as that for `SIModel` in the :ref:`example <gemprog>` code.  The symbol hierarchy for
:ref:`Example 1 <gemprog>` is shown in :ref:`Figure 2 <symtab>`.
Chris Jewell's avatar
Chris Jewell committed
113

114
115
116
Notes
`````
1. During symbol declaration, each symbol (:class:`IdRef <gem.gemlang.ast.ast_expression.IdRef>`) is annotated with a reference to the AST node representing the declaration.
Chris Jewell's avatar
Chris Jewell committed
117
118
2. Since gemlang is declarative, symbols may not be declared (or even assigned to) more than once.  An exception is
   raised if duplicate declarations are encountered.
119
120
121
3. Scopes are pushed onto a stack, starting with the global scope.  When a new (child) :class:`ScopedSymbol <gem.gemlang.symbol.ScopedSymbol>`
   is encountered, it is pushed onto the stack. When leaving the :class:`ScopedSymbol <gem.gemlang.symbol.ScopedSymbol>`,
   it is popped off the stack, returning to the parent scope.  Developers are referred to [Par2010]_ for further reading.
Chris Jewell's avatar
Chris Jewell committed
122

Chris Jewell's avatar
Chris Jewell committed
123
124
.. _symtab:
.. figure:: symtab.svg
125
    :scale: 10%
Chris Jewell's avatar
Chris Jewell committed
126
127
128
129
130
131
132
    :alt: Example symbol table

    Figure 2: The symbol table for the :ref:`example <gemprog>` GEM program.


2. Symbol Resolution
^^^^^^^^^^^^^^^^^^^^
133
134
135
136
137
At the symbol resolution stage, the AST is traversed and each encountered symbol
is looked up in the current scope's symbol table. If the symbol
exists, the symbol node (:class:`IdRef <gem.gemlang.ast.ast_expression.IdRef>`) is annotated with a reference to the
symbol in the symbol table.  If the symbol is not found, a syntax error exception is raised notifying the user that
an undefined variable exists in the code.
Chris Jewell's avatar
Chris Jewell committed
138
139
140

3. Data Injection
^^^^^^^^^^^^^^^^^
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
The GEM language allows the user to inject static (i.e. constant) data into a model at compile time.  This is done
much like the concept of placeholders in Tensorflow.  For example, in the linear model defined in :ref:`Example 2 <data-injection>`
covariate data is defined as a matrix placeholder, with data passed in when the GEM program is compiled.

.. code-block::
    :linenos:
    :caption: Example of data injection
    :name: data-injection

    prog = """
    X = Vector()
    alpha ~ Normal(0, 1000)
    beta ~ Normal(0, 1000)
    sigma ~ Gamma(2, 0.1)
    y ~ Normal(alpha + beta * X, sigma)
    """
    model = GEM(prog, const_data={'X': x_numpy})

Data injection is performed by the :class:`DataInjector <gem.gemlang.inject_data.InjectData>` class.  The algorithm
traverses the AST performing four actions:

1. Declaring assignments of placeholders to variables are replaced in the AST with a
   :class:`NullNode <gem.gemlang.ast.ast_base.NullNode>` so that no output code is generated.
   This is done because the data already exists in the user's host language environment.
2. Declarations of random variables with symbol names matching data object names in the user's
   `const_data` dictionary are re-written as static data objects.
3. Data structures in the user's `const_data` dictionary are converted into the maths layer's required
   format (see :func:`convert_to_maths_layer <gem.gemlang.tf_output.convert_to_maths_layer>`).
4. References to each constant data object are written into the AST in locations corresponding
   to the global scope.

The resulting AST contains :class:`AssignExpr <gem.gemlang.ast.ast_statement.AssignExpr>` nodes for each
piece of data injected *in the global scope*.


Code Generation
---------------
Chris Jewell's avatar
Chris Jewell committed
178
179
180
181
182
183
184
185
186
187
188
189
190
Code generation is performed by walking the AST and building a hierarchical data structure of objects of type
:class:`Outputter <gem.gemlang.tf_output.Outputter>` defined in the :mod:`gem.gemlang.tf_output` module.  The
:mod:`gem.gemlang.tf_output` module has one class (derived from :class:`Outputter <gem.gemlang.tf_output.Outputter>`)
per GEM language feature which then generates the target Python/Tensorflow generated code.

Tree walking is performed by the :class:`CodeGenerator <gem.gemlang.model_generator.CodeGenerator>` class, which
has a method for each node in the AST which assembles the appropriate collection of :class:`Outputter <gem.gemlang.tf_output.Outputter>`
objects.  In this sense, **source-to-source translation is intended to be performed by the
:class:`CodeGenerator <gem.gemlang.model_generator.CodeGenerator>` class** and not the :class:`Outputter <gem.gemlang.tf_output.Outputter>`
objects.

Note, code generation is not yet a perfect beast -- significant streamlining with possibly more levels of abstraction
are expected in future.
191
192


Chris Jewell's avatar
Chris Jewell committed
193

Chris Jewell's avatar
Chris Jewell committed
194
195
196


.. [Par2010]  Parr, T. Language Implementation Patterns.  The Pragmatic Programmers LLC. 2010.