Commit f4f7140f authored by Chris Jewell's avatar Chris Jewell
Browse files

More doc! Took the chance to factor out InjectData semantic analysis stage into its own file.

parent c0839319
Pipeline #246 passed with stage
in 4 minutes and 13 seconds
......@@ -36,7 +36,8 @@ master_doc = 'index'
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc',
'sphinx_rtd_theme'
'sphinx.ext.inheritance_diagram',
'sphinx_rtd_theme',
]
# Add any paths that contain templates here, relative to this directory.
......@@ -54,9 +55,11 @@ exclude_patterns = []
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
pygments_style = 'sphinx'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_static_path = []
nitpicky = True
......@@ -26,35 +26,42 @@ Given a gemlang program, lexical analysis and the first stage of syntactic analy
`Lark <https://github.com/lark-parser/lark>`_. On invocation, Lark reads the gemlang eBNF grammar definition
in `gem/gemlang/gem_grammar.cfgr`, and lexes and parses an input gemlang program to a *parse tree* as described in
the Lark documentation -- one tree node per production rule in the grammar.
Parsing is done using Lark's implementation of the Earley algorithm, providing a robust and
powerful method of parsing against the gemlang grammar.
Parsing is done using Lark's implementation of the `Earley <https://en.wikipedia.org/wiki/Earley_parser>`_ algorithm,
providing a robust and powerful method of parsing against the gemlang grammar.
In the second stage of syntactic analysis, the parse tree is then *transformed* into an
Abstract Syntax Tree (AST) representing the GEM program, with objects of (base) type `ASTNode` representing nodes
Abstract Syntax Tree (AST) representing the GEM program, with objects of (base) type
:class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>` representing nodes
in the tree. The reason for doing this is that the parse tree is entirely homogeneous, using Lark's built in `Tree`
class to represent nodes. For purposes of semantic analysis, we find it easier to work with specialisations of
`ASTNode`s (`Constant`, `State`, `Mul`, etc.) so that we can use Python's type system to identify operations and objects
represented within the AST. This has the advantage of decoupling parse tree generation from subsequent semantic
analysis, leading to a more modularised software architecture.
:class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>` (:class:`Number <gem.gemlang.ast.ast_expression.Number>`,
:class:`MulExpr <gem.gemlang.ast.ast_expression.MulExpr>`, :class:`Call <gem.gemlang.ast.ast_expression.Call>`, etc.) so that
we can use Python's type system to identify operations and objects
represented within the AST. This has the important advantage of decoupling parse tree generation from subsequent
semantic analysis, leading to a more modularised software architecture.
Where in the source code does this happen?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Lark is invoked via the `gem.interface.GEM.__parse__` method. This method calls the `gemparse` wrapper function
defined in `gem.gemlang.parse_gemlang`. `gemparse` returns an AST, a branching tree composed of objects of type
`ASTNode`. `ASTNode` is specialised into subclasses representing language features and concepts within gemlang, as
specified in the type hierarchy in `gem.gemlang.ast`.
Importantly, the transformation of the Lark parse tree to AST is implemented by `gem.gemlang.parse_gemlang.GEMParser`.
Lark is invoked via the :class:`GEM <gem.interface.GEM>` class, which in turn calls the
:func:`gemparse <gem.gemlang.parse_gemlang.gemparse>` function.
:func:`gemparse <gem.gemlang.parse_gemlang.gemparse>` returns an AST, a branching tree composed of objects of type
:class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>`. :class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>` is specialised
into subclasses representing language features and concepts within gemlang, as
specified in the type hierarchy in :mod:`gem.gemlang.ast`.
Importantly, the transformation of the Lark parse tree to AST is implemented by
:class:`GEMParser <gem.gemlang.parse_gemlang.GEMParser>`.
A unit test (`tests/unit/test_parse_completeness`) scans the GEM grammar for production rules, and makes sure a
complementary method exists in GEMParser. In addition, `GEMParser` will throw an exception if a production rule is
invoked in the grammar for which no method is implemented.
complementary method exists in GEMParser. In addition, :class:`GEMParser <gem.gemlang.parse_gemlang.GEMParser>` will
throw an exception if a production rule is invoked in the grammar for which no method is implemented.
Abstract syntax tree
--------------------
^^^^^^^^^^^^^^^^^^^^
The `AST <https://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ is a tree representation of a GEM program. Nodes within
the tree are objects of subclasses of `ASTNode` representing statements, atoms, and expressions within gemlang. The
tree may be traversed using depth-first or breadth-first using specialisations of the `gem.gemlang.ast_walker.ASTWalker`
class (see developer API documentation for the `gem.gemlang.ast`).
the tree are objects of subclasses of :class:`ASTNode <gem.gemlang.ast.ast_base.ASTNode>` representing statements, atoms, and expressions within gemlang. The
tree may be traversed using depth-first or breadth-first using specialisations of the
:class:`ASTWalker <gem.gemlang.ast_walker.ASTWalker>`
class (see developer API documentation for the :mod:`gem.gemlang.ast` module).
Semantic Analysis
-----------------
......@@ -80,28 +87,31 @@ To illustrate our description of semantic analysis, it is useful to consider a G
Semantic analysis currently consists of 3 stages which are executed sequentially:
1. Symbol Declaration (:class:`ParseDeclarations <gem.gemlang.symbol\_resolve.ParseDeclarations>`)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. Symbol Declaration (:class:`ParseDeclarations <gem.gemlang.symbol_resolve.ParseDeclarations>`)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In this stage, a symbol table is built containing builtin GEM symbols as well as symbols declared in the GEM program.
Since gemlang is implicitly typed, symbols representing variables are declared when they are first assigned.
In the :ref:`example <gemprog>` code, symbols `mu`, `beta`, `S`, `I`, and `epi` are pushed into the symbol table on
In :ref:`Example 1 <gemprog>`, symbols `mu`, `beta`, `S`, `I`, and `epi` are pushed into the symbol table on
their first assignment (lines 1, 2, 4, 5, 8 respectively).
In GEM, symbols are represented by the :class:`Symbol <gem.gemlang.symbol.Symbol>` class hierarchy. Symbol tables are
represented by the :class:`Scope <gem.gemlang.symbol.Scope>` class hierarchy, which is essentially a wrapper around a
Python `dict` object storing *name: symbol* pairs. Symbols can also have scopes themselves, representing declarations
such as that for `SIModel` in the :ref:`example <gemprog>` code. The symbol hierarchy for the
:ref:`example <gemprog>` is shown in :ref:`Figure 2 <symtab>`.
such as that for `SIModel` in the :ref:`example <gemprog>` code. The symbol hierarchy for
:ref:`Example 1 <gemprog>` is shown in :ref:`Figure 2 <symtab>`.
**Notes**
1. During symbol declaration, each symbol is annotated with a reference to the AST node representing the declaration.
Notes
`````
1. During symbol declaration, each symbol (:class:`IdRef <gem.gemlang.ast.ast_expression.IdRef>`) is annotated with a reference to the AST node representing the declaration.
2. Since gemlang is declarative, symbols may not be declared (or even assigned to) more than once. An exception is
raised if duplicate declarations are encountered.
3. Scopes are pushed onto a stack, starting with the global scope. When a new (child) :class:`ScopedSymbol <gem.gemlang.symbol.ScopedSymbol>`
is encountered, it is pushed onto the stack. When leaving the :class:`ScopedSymbol <gem.gemlang.symbol.ScopedSymbol>`,
it is popped off the stack, returning to the parent scope. Developers are referred to [Par2010]_ for further reading.
.. _symtab:
.. figure:: symtab.svg
:scale: 60%
:scale: 10%
:alt: Example symbol table
Figure 2: The symbol table for the :ref:`example <gemprog>` GEM program.
......@@ -109,10 +119,53 @@ such as that for `SIModel` in the :ref:`example <gemprog>` code. The symbol hie
2. Symbol Resolution
^^^^^^^^^^^^^^^^^^^^
At the symbol resolution stage, the AST is traversed and each encountered symbol is
At the symbol resolution stage, the AST is traversed and each encountered symbol
is looked up in the current scope's symbol table. If the symbol
exists, the symbol node (:class:`IdRef <gem.gemlang.ast.ast_expression.IdRef>`) is annotated with a reference to the
symbol in the symbol table. If the symbol is not found, a syntax error exception is raised notifying the user that
an undefined variable exists in the code.
3. Data Injection
^^^^^^^^^^^^^^^^^
The GEM language allows the user to inject static (i.e. constant) data into a model at compile time. This is done
much like the concept of placeholders in Tensorflow. For example, in the linear model defined in :ref:`Example 2 <data-injection>`
covariate data is defined as a matrix placeholder, with data passed in when the GEM program is compiled.
.. code-block::
:linenos:
:caption: Example of data injection
:name: data-injection
prog = """
X = Vector()
alpha ~ Normal(0, 1000)
beta ~ Normal(0, 1000)
sigma ~ Gamma(2, 0.1)
y ~ Normal(alpha + beta * X, sigma)
"""
model = GEM(prog, const_data={'X': x_numpy})
Data injection is performed by the :class:`DataInjector <gem.gemlang.inject_data.InjectData>` class. The algorithm
traverses the AST performing four actions:
1. Declaring assignments of placeholders to variables are replaced in the AST with a
:class:`NullNode <gem.gemlang.ast.ast_base.NullNode>` so that no output code is generated.
This is done because the data already exists in the user's host language environment.
2. Declarations of random variables with symbol names matching data object names in the user's
`const_data` dictionary are re-written as static data objects.
3. Data structures in the user's `const_data` dictionary are converted into the maths layer's required
format (see :func:`convert_to_maths_layer <gem.gemlang.tf_output.convert_to_maths_layer>`).
4. References to each constant data object are written into the AST in locations corresponding
to the global scope.
The resulting AST contains :class:`AssignExpr <gem.gemlang.ast.ast_statement.AssignExpr>` nodes for each
piece of data injected *in the global scope*.
Code Generation
---------------
......
......@@ -7,7 +7,7 @@ computational and statistical methodology. The package is written in Python, an
1. The GEM model description language, *gemlang*;
2. Maths functions encoding the probabilistic GEM model;
3. A back-end extensible algorithms layer for simulation and inference;
4. A scalable and hardware-independent maths layer, currently [Tensorflow](https://www.tensorflow.org)
4. A scalable and hardware-independent maths layer, currently `Tensorflow <https://www.tensorflow.org>`_
GEM provides both a user interface and a developer interface, allowing use by epidemics scientists analysing outbreaks
as well as methods researchers (statisticians, computer scientists, etc) to add new functionality to the package.
......
gem.gemlang.ast package
=======================
gem.gemlang.ast module
======================
Submodules
----------
gem.gemlang.ast.ast\_base
-------------------------
gem.gemlang.ast.ast\_base module
--------------------------------
.. inheritance-diagram:: gem.gemlang.ast.ast_base
:parts: 1
.. automodule:: gem.gemlang.ast.ast_base
:members:
:undoc-members:
:show-inheritance:
gem.gemlang.ast.ast\_expression module
--------------------------------------
gem.gemlang.ast.ast\_expression
-------------------------------
.. inheritance-diagram:: gem.gemlang.ast.ast_expression
:parts: 1
.. automodule:: gem.gemlang.ast.ast_expression
:members:
:undoc-members:
:show-inheritance:
gem.gemlang.ast.ast\_statement module
-------------------------------------
gem.gemlang.ast.ast\_statement:
-------------------------------
.. inheritance-diagram:: gem.gemlang.ast.ast_statement
:parts: 1
.. automodule:: gem.gemlang.ast.ast_statement
:members:
......
......@@ -43,6 +43,21 @@ gem.gemlang.symbol module
:undoc-members:
:show-inheritance:
gem.gemlang.symbol\_resolve module
----------------------------------
.. automodule:: gem.gemlang.symbol_resolve
:members:
:undoc-members:
:show-inheritance:
gem.gemlang.inject\_data module
-------------------------------
.. automodule:: gem.gemlang.inject_data
:members:
:show-inheritance:
gem.gemlang.tf\_output module
-----------------------------
......
......@@ -22,6 +22,7 @@ import os
from lark import Lark
from gem import gemlang
from .ast_base import ASTNode
def make_ast_parser():
glpath = os.path.dirname(inspect.getfile(gemlang))
......
from gem.gemlang import ASTWalker
from gem.gemlang.ast.ast_base import NullNode
from gem.gemlang.ast.ast_expression import IdRef
from gem.gemlang.ast.ast_statement import AssignExpr
from gem.gemlang.symbol import PlaceholderSymbol
from gem.gemlang.tf_output import convert_to_maths_layer
class InjectData(ASTWalker):
"""InjectData does three things:
1. It converts raw data to tf.Tensor
2. It looks up data placeholders in `data`, removing declaration nodes from AST
3. It looks up random variables in the `data` dictionary, adding them as a 'value' field if observations exist
:param symtab: a SymbolTable instance
:param data: a dictionary of data to inject into the model
:return a tuple of (ASTNode, SymbolTable)
"""
def __init__(self, data):
super().__init__()
self.__data = convert_to_maths_layer(data)
def onExit_GEMProgram(self, ast_node):
# N.B. global scope is pushed by __init__
children = ast_node.children
for k, v in self.__data.items():
children.insert(0, AssignExpr(IdRef(k),
IdRef(f"gem_external['{k}']")))
ast_node.children = children
return ast_node
def onExit_AssignExpr(self, assign):
lhs = assign.children[0]
rhs = assign.children[1]
if isinstance(rhs.symbol, PlaceholderSymbol):
# ensure lhs is represented in the data
if not lhs.value in self.__data.keys():
raise KeyError(
f"Data for Placeholder '{lhs.value}' not supplied at line {lhs.meta.line} column {lhs.meta.column}")
# TODO ensure types match
return NullNode("Placeholder removed")
def onExit_StochasticAssignExpr(self, assign):
lhs = assign.children[0]
rhs = assign.children[1]
if lhs.value in self.__data.keys():
raise SyntaxError(
f"Redefinition of symbol '{lhs.value}' as constant data")
\ No newline at end of file
......@@ -3,11 +3,8 @@
from collections import namedtuple
from gem.gemlang import ASTWalker
from gem.gemlang.ast.ast_base import NullNode
from gem.gemlang.ast.ast_statement import AssignExpr
from gem.gemlang.ast.ast_expression import Call, KwArg, IdRef
from gem.gemlang.ast.ast_expression import Call, KwArg
from gem.gemlang.symbol import *
from gem.gemlang.tf_output import convert_to_maths_layer
class SemanticAnalysis(ASTWalker):
......@@ -163,49 +160,6 @@ class ResolveSymbols(SemanticAnalysis):
ast_node.symbol = sym
class InjectData(ASTWalker):
"""InjectData does three things:
1. It converts raw data to tf.Tensor
2. It looks up data placeholders in `data`, removing declaration nodes from AST
3. It looks up random variables in the `data` dictionary, adding them as a 'value' field if observations exist
:param symtab: a SymbolTable instance
:param data: a dictionary of data to inject into the model
:return a tuple of (ASTNode, SymbolTable)
"""
def __init__(self, data):
super().__init__()
self.__data = convert_to_maths_layer(data)
def onExit_GEMProgram(self, ast_node):
# N.B. global scope is pushed by __init__
children = ast_node.children
for k, v in self.__data.items():
children.insert(0, AssignExpr(IdRef(k),
IdRef(f"gem_external['{k}']")))
ast_node.children = children
return ast_node
def onExit_AssignExpr(self, assign):
lhs = assign.children[0]
rhs = assign.children[1]
if isinstance(rhs.symbol, PlaceholderSymbol):
# ensure lhs is represented in the data
if not lhs.value in self.__data.keys():
raise KeyError(
f"Data for Placeholder '{lhs.value}' not supplied at line {lhs.meta.line} column {lhs.meta.column}")
# TODO ensure types match
return NullNode("Placeholder removed")
def onExit_StochasticAssignExpr(self, assign):
lhs = assign.children[0]
rhs = assign.children[1]
if lhs.value in self.__data.keys():
raise SyntaxError(
f"Redefinition of symbol '{lhs.value}' as constant data")
RandomVariableList = namedtuple("RandomVariableList", ['free', 'observed'])
......
......@@ -29,7 +29,8 @@ from tensorflow_probability import edward2 as ed
from gem.gemlang import gemparse
from gem.gemlang.model_generator import CodeGenerator
from gem.gemlang.symbol_resolve import ParseDeclarations, ResolveSymbols, \
InjectData, get_random_vars
get_random_vars
from gem.gemlang.inject_data import InjectData
from gem.model.edward2_extn import make_value_setter, TransformedRVBijector
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment