23rd November 2020
Let's build a working template language from scratch in less than 300 lines of Python. Our finished language will look like Jinja, although it will be a lot simpler. Here's a sample snippet:
<ul> {% for post in posts %} <li><a href="{{ post.url }}">{{ post.title }}</a></li> {% endfor %} </ul>
Jinja is styled on Django so let's complete the circle and call our new language Djindjo. You can find a full working implementation on Github.
We should start by describing the language we're going to build. Our language will support three kinds of template markup: comment tags, print tags, and instruction tags.
Comment tags are the simplest kind; they look like this:
{# This is a comment. #}
Print tags print the value of a variable, using dot-syntax to drill into nested dictionaries and objects. They look like this:
{{ post.title }}
Instruction tags are the most complex; they implement looping and conditional logic.
Loops use a {% for %}
tag and look like this:
{% for post in posts %} <p>{{ post.title }}</p> {% endfor %}
Conditions use an {% if %}
tag and look like this:
{% if post.title %} <h1>{{ post.title }}</h1> {% else %} <h1>Missing Title</h1> {% endif %}
(I've assumed in all these examples that we're using our template language to generate HTML but we could use it to generate any kind of text output — it isn't limited to producing web pages.)
So how does our language work? From the user's point of view, very simply. The user creates a Template
object from a string, passes in a dictionary of input data, and gets back a rendered output string, just
like this:
>>> template = Template('{{foo}} and {{bar}}') >>> template.render({'foo': 'ham', 'bar': 'eggs'}) 'ham and eggs' >>> template.render({'foo': 1, 'bar': 2}) '1 and 2'
Here's a top-level overview of what's happening under the hood.
A Template
object is initialized with a template string — a string containing template tags,
possibly loaded from a template file.
This input string is passed to a Lexer
which transforms it into a list of tokens.
These tokens are passed to a Parser
which compiles them into a tree of nodes.
The Template
object stores this node tree for future rendering.
Each node in the tree has a .render()
method which takes a Context
object containing input data
and returns a string. The entire compiled node tree can be rendered by calling .render()
on the
root node.
Compiling and rendering the node tree are two distinct operations. The template only needs to be compiled once; it can then be rendered multiple times with different context objects.
Time to get coding — here's our Template
class:
class Template: def __init__(self, template_string): self.root_node = Parser(template_string).parse() def render(self, data_dict): return self.root_node.render(Context(data_dict))
The Parser
class is doing most of the heavy lifting here and we'll look at it soon, but before
we move on to the compilation stage we need to write the code for two important helper classes.
The first is the Context
class:
class Context: def __init__(self, data_dict): self.stack = [data_dict] def __setitem__(self, key, value): self.stack[-1][key] = value def __getitem__(self, key): for stack_dict in reversed(self.stack): if key in stack_dict: return stack_dict[key] raise KeyError(key) def push(self): self.stack.append({}) def pop(self): self.stack.pop() def lookup(self, varstring): result = self for token in varstring.split('.'): try: result = result[token] except: try: result = getattr(result, token) except: result = None break return result
A Context
object is a wrapper around our input data — it's basically a stack of dictionaries with
the input dictionary at the bottom.
We can .push()
new dictionaries onto this stack or .pop()
the most recent dictionary off the top.
(We won't see it in action for a while yet but this stack will allow our templates to safely nest
for-loops inside for-loops inside for-loops without any danger of overwriting the loop variables.)
We can write to and read from a context object as if it were a dictionary. When we write to a context object, e.g.
context["foo"] = "bar"
the __setitem__()
method writes to the most recent dictionary on the stack. When we read from a
context object, e.g.
foo = context["bar"]
the __getitem__()
method checks each dictionary on the stack for a matching key, working
backwards from the most recent.
The .lookup()
method contains the lookup-logic for resolving dotted variable names.
If we pass it the string "post.title"
it will split the input into two tokens, "post"
and
"title"
.
It will then search its own dictionary stack for an entry with the key "post"
.
If it finds a matching entry it will check that object for either a dictionary key ["title"]
or an
attribute .title
. If it succeeds it will return the associated value; if this process fails at any
point it will instead return the value None
.
It's worth taking some time to carefully think your way through this method as it's quite dense!
Our second helper class is a lot simpler:
class TemplateError(Exception): pass
We'll use this exception type to report compilation errors.
Lexing is the first step of the compilation process. A Lexer
takes a template string as input and
chops it into a stream of tokens which can be consumed by a Parser
.
Each token has a type and our lexer will output three different types of token:
{% instruction %}
tags
{{ print }}
tags
We don't need a special token for {# comment #}
tags as our lexer will simply discard comments as
it finds them.
Here's an example of a tokenized template string, lovingly rendered in ascii art:
"foo bar {% if varname %}{{ varname }}{% endif %} bar foo" |------||--------------||-----------||---------||------| ^ ^ ^ ^ ^ | | | | | text | print | text instruction instruction
If we take this sample string and pass it into a lexer we can view the token list directly:
>>> string = "foo bar {% if varname %}{{ varname }}{% endif %} bar foo" >>> for token in Lexer(string).tokenize(): ... print(token) ('text', 'foo bar ') ('instruction', 'if varname') ('print', 'varname') ('instruction', 'endif') ('text', ' bar foo')
You'll notice that the lexer is stripping the tag delimiters from {% instruction %}
and
{{ print }}
tags and keeping only their content.
Here's the code for the Token
class:
class Token: def __init__(self, token_type, text): self.type = token_type self.text = text def __str__(self): return f"({repr(self.type)}, {repr(self.text)})" @property def keyword(self): return self.text.split()[0]
We could have used a simple tuple for tokens instead of a custom class but I wanted to add a .keyword
property as a convenience for accessing the first word of an {% instruction %}
tag.
The Lexer
class is the single bulkiest piece of code in the whole library so we'll take it
in sections. Here's the most important part:
class Lexer: def __init__(self, template_string): self.template_string = template_string self.tokens = [] self.index = 0 def tokenize(self): while self.index < len(self.template_string): if self.match("{#"): self.read_comment_tag() elif self.match("{{"): self.read_print_tag() elif self.match("{%"): self.read_instruction_tag() else: self.read_text() return self.tokens def match(self, target): if self.template_string.startswith(target, self.index): return True return False
The .tokenize()
method is going to step through the template string one character at a time,
reading off tokens as it goes and appending them to the .tokens
list. Let's look at the
comment-processing method first:
def read_comment_tag(self): self.index += 2 while self.index < len(self.template_string) - 1: if self.match("#}"): self.index += 2 return self.index += 1 raise TemplateError("unclosed comment tag")
Once we've matched on an opening comment delimiter {#
we step through the string
character-by-character looking for the matching closing delimiter #}
. We don't add a token to the
.tokens
list for comments, instead we simply 'step over' them and keep moving forward.
If we reach the end of the string without finding a matching closing delimiter we raise a
TemplateError
exception.
Our error message should really include the line number of the opening
delimiter to make debugging easier for the user — I'm omitting that kind of bookkeeping code here
for simplicity but you can add it yourself if you intend using your template language for any
serious work.
The code for reading {{ print }}
tags is very similar:
def read_print_tag(self): self.index += 2 start_index = self.index while self.index < len(self.template_string) - 1: if self.match("}}"): text = self.template_string[start_index:self.index].strip() self.tokens.append(Token("print", text)) self.index += 2 return self.index += 1 raise TemplateError("unclosed print tag")
This time we create a Token
and append it to the .tokens
list. (The token's .type
attribute is
a simple string, "print"
. You might think about using an enum here instead which would add extra
type-safety at the cost of increasing the complexity of the code.)
The code for reading {% instruction %}
tags is almost identical to the code above:
def read_instruction_tag(self): self.index += 2 start_index = self.index while self.index < len(self.template_string) - 1: if self.match("%}"): text = self.template_string[start_index:self.index].strip() self.tokens.append(Token("instruction", text)) self.index += 2 return self.index += 1 raise TemplateError("unclosed instruction tag")
This time we create an "instruction"
token and append it to the .tokens
list.
Only one more method to go. We want to create a "text"
token for each sequence of ordinary text,
i.e. text that isn't wrapped in tag delimiters:
def read_text(self): start_index = self.index while self.index < len(self.template_string): if self.match("{#") or self.match("{{") or self.match("{%"): break self.index += 1 text = self.template_string[start_index:self.index] self.tokens.append(Token("text", text))
That's it, our lexer is complete. We can now pass our list of tokens over to the parser for compilation.
A Parser
object takes a stream of tokens as input and compiles them into an AST or abstract
syntax tree. This tree of nodes is the compiled template — it can be rendered into an output string by calling the root node's .render()
method and passing in a Context
object of data.
We'll look at the code for the Parser
class first and then at the individual Node
classes which make up the compiled tree.
class Parser: keywords = { "if": (IfNode, "endif"), "else": (ElseNode, None), "for": (ForNode, "endfor"), } endwords = ["endif", "endfor"] def __init__(self, template_string): self.template_string = template_string def parse(self): stack = [Node()] expecting = [] for token in Lexer(self.template_string).tokenize(): if token.type == "text": stack[-1].children.append(TextNode(token)) elif token.type == "print": stack[-1].children.append(PrintNode(token)) elif token.keyword in self.keywords: node_class, endword = self.keywords[token.keyword] node = node_class(token) stack[-1].children.append(node) if endword: stack.append(node) expecting.append(endword) elif token.keyword in self.endwords: if len(expecting) == 0: raise TemplateError(f"unexpected {token.keyword}") elif expecting[-1] != token.keyword: raise TemplateError(f"expected {expecting[-1]}, found {token.keyword}") else: stack[-1].process_children() stack.pop() expecting.pop() else: raise TemplateError(f"illegal instruction '{token.keyword}'") if expecting: raise TemplateError(f"expecting {expecting[-1]}") return stack.pop()
The .parse()
method is the most complex code in the library so take your time stepping through
it.
The parse stack
is the magic ingredient that lets our template language support nested {% instruction
%}
tags. When we create the stack it contains a single node, the root node. We haven't met the Node
class
yet but each node maintains a list of child nodes.
As the parser steps through the token stream it creates new nodes and appends them as children to
the node currently on top of the stack — this is the node referenced by the stack[-1]
expression.
Text tokens and print tokens are easy to deal with — the parser simply creates a new TextNode
or
PrintNode
and adds it as a child to the node on top of the stack.
Instruction tokens are more complex.
When the parser finds an instruction token it creates an instruction node and adds it as a child to
the node on top of the stack as before. But it also needs to check whether the instruction is
atomic or has block scope, i.e. whether it can contain nested content.
If the instruction has block scope (indicated by the existence of an associated end word, e.g.
endif
for if
), the parser pushes the instruction node onto the stack where it becomes the new
top node, ready to accept future children.
The parser also needs to look out for tokens containing end words. When it finds an end word it pops
the top node off the stack as the end of its scope has been reached. The expecting
stack keeps
track of all the nodes that are currently in scope so the parser can raise a TemplateError
if it
finds an instruction token with an unexpected end word.
When the parser reaches the end of the token stream the expecting
stack should be empty. If it is,
the parser pops the root node from the parse stack and returns it. The template has been
successfully compiled.
Here's the code for our base Node
class:
class Node: def __init__(self, token=None): self.token = token self.children = [] def render(self, context): return "".join(child.render(context) for child in self.children) def process_children(self): pass
The root node uses this class directly while our other node classes inherit from it. (The version of this class in the demonstration code contains some extra methods for pretty-printing the node tree which I'm omitting here for brevity.)
The parser calls each block-scoped node's .process_children()
method just before popping it off the
stack. We'll override this stub method in the IfNode
class to implement if/else branching.
The TextNode
subclass is extremely simple:
class TextNode(Node): def render(self, context): return self.token.text
As is the PrintNode
subclass:
class PrintNode(Node): def render(self, context): value = context.lookup(self.token.text) if value is None: return "" return str(value)
The IfNode
subclass is more complex:
class IfNode(Node): regex = re.compile(r"^if\s+([\w.]+)$") def __init__(self, token): super().__init__(token) if (match := self.regex.match(token.text)): self.arg_string = match.group(1) else: raise TemplateError("malformed {% if %} tag") self.true_branch = Node() self.false_branch = Node() def render(self, context): value = context.lookup(self.arg_string) if value: return self.true_branch.render(context) else: return self.false_branch.render(context) def process_children(self): branch = self.true_branch for child in self.children: if isinstance(child, ElseNode): branch = self.false_branch else: branch.children.append(child)
To keep this tutorial to a manageable length our {% if %}
tag will only support a simple truthiness
test, resolving a variable name and checking if Python considers the resulting object to be true or
false.
We use a regular expression in the constructor to extract this variable name while simultaneously
verifying that the tag's content is well-formed. If it isn't, we raise a TemplateError
.
We override the base class's stub .process_children()
method to implement if/else branching, splitting the node's
children into a .true_branch
and a .false_branch
by checking for the presence of an ElseNode
corresponding to an (optional) {% else %}
tag.
The ElseNode
itself is an empty subclass:
class ElseNode(Node): pass
We could expand this technique to add support for chained {% elif %}
tags, although I'm going to leave
that task as an exercise for the interested reader.
Our final node subclass is the ForNode
:
class ForNode(Node): regex = re.compile(r"^for\s+(\w+)\s+in\s+([\w.]+)$") def __init__(self, token): super().__init__(token) if (match := self.regex.match(token.text)): self.var_string = match.group(1) self.arg_string = match.group(2) else: raise TemplateError("malformed {% for %} tag") def render(self, context): output = [] collection = context.lookup(self.arg_string) if hasattr(collection, '__iter__'): context.push() for item in collection: context[self.var_string] = item output.append("".join(child.render(context) for child in self.children)) context.pop() return "".join(output)
Once again we use a regular expression in the constuctor to verify that the tag is well-formed and to extract our variable names.
In the .render()
method we finally get to use our Context
object's dictionary stack! We push a new
dictionary onto the stack before running the loop and pop it off again immediately after. This ensures
that our loop variable will only be visible within the loop itself and that
any existing variables with the same name will be shadowed for the duration of the loop but not
overwritten.
That's it, our template language is complete! It's simple but fully-functional and potentially quite useful. Sometimes a simple language that you can customize to suit your own specific use case is more valuable than a sophisticated but general-purpose tool.
I'm sure you can think of lots of ways to expand and improve our language. Making its error messages more descriptive by adding line numbers would be a great start, but here's a couple of other ideas:
You could think about adding support for Django-style filters in print tags, e.g.
{{ post.title|escape }}
You could think about adding an {% include %}
instruction for loading nested templates, e.g.
{% include sidebar %}
When I built my own template language, Ibis, the most useful construct I added was a chaining fallback operator for print tags which prints the first expression to evaluate as truthy, e.g.
{{ post.meta_title || post.title || "Missing Title" }}
If you decide to build your own template language you'll probably find Python's ast.literal_eval()
function extremely useful. It can parse simple literal expressions like strings and numbers, or even more complex structures like tuples and lists.