# Pattern matching

## Introduction

Pattern matching is an integral part of a computation toolkit. It allows to isolate and modify parts of an expression with replacement rules. A properly broad pattern matcher is effectively its own Turing complete programming language!

For example, Conway’s game of life can be completely emulated by pattern matching, where each cell together with its surrounding cells determined the future value of the cell.

Other common uses of pattern matching is in regular expression (regex), widely used for matching patterns in text. The following pattern

`^[a-zA-Z0-9]+@gmail\.com$`

matches e-mail addresses with at least one alphanumerical character and ending in `@gmail.com`

.

To faithfully parse an e-mail address, a horrendous pattern match is required!

Pattern matching is also ubiquitous in mathematics. For example, the recursive definition of the factorial function

\[ \begin{align} f(0) &= 1\\ f(n) &= n f(n - 1);\quad n > 0 \end{align} \]

is a pattern match. First, the pattern \(f(0)\) is tested. If it is matched, it is replaced by 1. Then the pattern \(f(n)\) is matched, where \(n\) needs to be positive and it is replaced by \(n f(n-1)\), where the \(n\) takes the value of the \(n\) matched in the pattern.

Symbolica has rich support for pattern matching. Let us see some examples where we match and replace all literal parts of the expression.

In this part we use the Python API, with shorthand notation. We match \(x\) in the expression \(x+f(x) + 5\) and replace it by 6, using use replace_all.

```
from symbolica import *
= S('x', 'f')
x, f = x + f(x) + 5
e = e.replace_all(x, 6)
e print(e)
```

We get `11 + f(6)`

To replace `f(x)`

by `6`

, we change the pattern:

`= e.replace_all(f(x), 6) e `

which yields `11 + x`

.

To replace multiple patterns at the same time, use replace_all_multiple. For example:

```
= S('x', 'y', 'f')
x, y, f = f(x, y).replace_all_multiple([Replacement(x, y), Replacement(y, x)])
k print(k)
```

replaces `f(x,y)`

by `f(y,x)`

.

### Wildcards

Let us now try to replicate the factorial function pattern match mentioned above. For that we need to be able to match to any numerical value in the function `f`

and we do not know that value in advance. Writing out an explicit rule for every desired number, e.g., `f(40) = 40*f(39)`

is possible, but tedious. Instead, we use a *wildcard* that can match any subexpression. The specific rules will be explained later.

A wildcard is defined as a variable ending in a `_`

(note that the name of the Python variable does not matter). We define a wildcard `n_`

and write the rule:

```
from symbolica import *
= S('n_', 'f')
n_, f = f(4)
e = e.replace_all(f(0), 1)
e = e.replace_all(f(n_), n_*f(n_-1))
e print(e)
```

This is almost perfect; the pattern still matches to negative numbers, and non-numbers. We can restrict the pattern by adding a third argument to `replace_all`

:

`= e.replace_all(f(n_), n_*f(n_-1), n_.req_gt(0)) e `

## Rules

We now set out all the rules for when patterns match.

### Syntactic matching

Patterns match syntactically / structurally, which means that the patten must appear explicitly in the tree representation of the expression. For example, the pattern \(x^2\) is not found in the expression \((x+1)^2\), even though it appears in the mathematically equivalent (but structurally different) expression \(x^2+2x+1\).

The pattern does not necessarily have to appear verbatim in the tree structure though. Symbolica uses mathematical properties such as the symmetry of the multiplication and addition operator, and of symmetric functions, to match atoms in unordered fashion. For example \(x + z\) matches \(x+y+z\), even though \(y\) is “in between” in the normal form.

Since matches on products, sums and symmetric functions are unordered, in order to establish that there is no match, in the worst case all permutations or subsets have to be tested. This could be expensive. Providing constraints on the pattern, especially length constraints, will help reduce the cost.

### Wildcards

Wildcards are variables ending in one or more underscores `_`

. The number of underscores signify various restrictions on the wildcard length. Wildcards can match any subexpression and will always include the operator it matches. For example, matching `x___`

(no length restrictions) on `a*b*c`

yields the following options:

Note that the empty set is a valid option. Because of the rule of syntactic matching, the pattern `y + x___`

is not present in the expression `y`

even though `x___`

can be the empty set.

The cases where one or zero sub-atoms are matched is subject to *downcasting*. For example,

is transformed to

which means that the atom type of `x___`

is no longer `Mul`

but the type of `a`

. The atom type of the empty match is `Empty`

. Downcasting ensures that repeated patterns such as `x_*f(x_)`

match to `x*f(x)`

even though the first `x_`

matches `Mul Var x`

and the second one `Arg Var x`

.

Function arguments behave analogously to the case of `Mul`

described above, except that the order is preserved if the function is non-symmetric. The pattern matcher interprets function arguments as an operator of type `Arg`

. For example, when matching `f(x___)`

to `f(a,b,c)`

the pattern matcher views the arguments as

When a wildcard that matches multiple arguments is used outside of a function context on the right-hand side of a replacement, the matched arguments will be wrapped in the internal function `arg`

. For example:

`= f(1,2).replace_all(f(x___), x___) e `

yields `arg(1,2)`

. An `arg`

function that is placed into a function context again, is flattened during normalization, e.g., `f(arg(1,2)) = f(1,2)`

.

### Repeated wildcards

A repeated wildcard means that the matched value must evaluate to exactly the same atoms. For example:

```
= x*y*f(x*y)
e = e.replace_all(x___*f(x___), 1) e
```

matches because `x___`

matches to `x*y`

outside of the function and inside the function. See Examples for more complicated examples.

### Precedence rules

The expression that is tested first is the root node with all its children. If that does not match, the subset of the root node with one sub-atoms removed is tested. If no match is found, two sub-atoms are removed, etc. Once all subsets are tested, the pattern matcher repeats the process for each of the sub-atoms as the new root.

For example, the pattern `x_`

will match any subexpression in any expression, so it will match the root first:

```
= x*y*z
e = e.replace_all(x_, 1) e
```

will yield `1`

, since `x_`

matches `x*y*z`

.

In the following case the pattern matches at the root level, so a match is returned, even though the child would have matched too:

```
= f(f(x))
e = e.replace_all(f(x_), f(1)) e
```

will yield `f(1)`

, since `x_`

matches `f(x)`

.

In the following case the root did not match, so the sub-atom is matched:

```
= g(f(x))
e = e.replace_all(f(x_), f(1)) e
```

will yield `g(f(1))`

, since `x_`

matches `f(x)`

.

`replace_all`

will always replace the first match found for every non-overlapping match. To iterate matches, see Match iteration.

### Ordering

A wildcard will match to a given set of atoms only once and it will have the order of the normal form. This means that for a symmetric function `f(1,2,3)`

, `f(x___)`

will generate one match, `x___ = 1,2,3`

, and it will never permute through all options (e.g. `x___ = 2,3,1`

). Consequently,

```
= S('f')
f = S('fsym', is_symmetric=True)
f_sym = f(3,2,1)*fsym(1,2,3)
e = e.replace_all(f(x___)*fsym(x___), f(1)) e
```

will not match.

## Conditions

Wildcards can be restricted based on arbitrary conditions. Below are some built-in options.

#### Wildcard length

It is possible to filter on the length of the wildcard **before** downcasting. The length is defined as the number of sub-atoms for `Add`

, `Mul`

, and `Arg`

and is 1 for any other atom.

A function has a length of 1 (it counts as one atom) regardless of the number of arguments. This way `x_`

with a wildcard length of 1 matches `f(...)`

regardless of the number arguments.

The number of underscores at the end of the wildcard name determines pre-set length restrictions. We have:

`x_`

: must match a single atom`x__`

: must match one or more atoms`x___`

: must match zero or more atoms

The length of `x__`

and `x___`

can be further restricted. For example, let us restrict the wildcard `x__`

to a length between 2 and 3

```
= f(1).replace_all(f(x__), 1, x__.req_len(2, 3))
e1 = f(1,2).replace_all(f(x__), 1, x__.req_len(2, 3))
e2 = f(1,2,3).replace_all(f(x__), 1, x__.req_len(2, 3)) e3
```

then only `e2`

and `e3`

will yield `1`

.

Even though the wildcard length of `x_`

is restricted to 1 and `x*y`

has a length of 2,

`= f(x*y).replace_all(f(x_, 1)) e1 `

we still get the match `x_ = x*y`

. This is because the length constraint is applied *before* downcasting.

A wildcard length restriction check is applied at any point that a wildcard is used and therefore

`*y*f(x*y).replace_all(x__*f(x__), 1, x__.req_len(2, 2)) x`

does not match, since `x__`

has wildcard length 1 inside the function argument. To restrict the length of the atom matched to `x__`

after downcasting, use a filter function.

##### Type restriction

A wildcard can be restricted based on the atom type.

For example, `x_`

can only match a variable:

`= f(x).replace_all(f(x_), 1, x_.req_type(AtomType.Var)) e1 `

##### Filter function

Custom filters can be added to restrict any wildcard. This is a function receives the wildcard name and its match expression as an argument and should return a boolean. If true, the match is accepted.

```
from symbolica import *
= S('x_', 'f')
x_, f = f(1)*f(2)*f(3)
e = e.replace_all(f(x_), 1, x_.req(lambda m: m == 2 or m == 3)) e
```

Since a Python function is called, this operation has significant overhead. For high-performance code, consider implementing the filter in Rust, or use one of the built-in filters, such as `req_gt`

and `reg_cmp_gt`

that does numerical comparisons in Rust.

##### Comparison function

A more complicated class of custom filters is based on using the values of two wildcards.

```
from symbolica import *
= S('x_', 'y_', 'f')
x_, y_, f = f(1)*f(2)*f(3)
e = e.replace_all(f(x_)*f(y_), 1, x_.req_cmp(y_, lambda m1, m2: m1 + m2 == 4)) e
```

##### Match stack filter

It is also possible to filter patterns based on the currently matched wildcards. For example, here is a filter for an ascending order of 3 variables:

```
= S('f', 'x_', 'y_', 'z_')
f, x_, y_, z_
def filter(m: dict[Expression, Expression]) -> int:
if x_ in m and y_ in m:
if m[x_] > m[y_]:
return -1 # no match
if z_ in m:
if m[y_] > m[z_]:
return -1
return 1 # match
return 0 # inconclusive
= f(1, 2, 3).replace_all(f(x_, y_, z_), 1,
e filter)) PatternRestriction.req_matches(
```

The `filter`

function is called after a change occurs to the currently matched wildcards, so the user has to treat cases where wildcards they want to act on are not yet matched. The function must return an `int`

instead of a `bool`

, as it needs to distinguish the cases of certainly no match (`<0`

), potentially a match (`0`

), or certainly a match (`>0`

). Failing to properly account for the inconclusive case may lead to unexpected pattern matches.

#### Logical conditions

Multiply wildcard conditions can be chained using the **and** (`&`

), **or** (`|`

) and **not** (`~`

in Python and `!`

in Rust) operators. For example:

```
= S('x', 'x_', 'x___', 'y___')
x, x___, x___, y___ = S('f')
f = f(3, x, 2).replace_all(f(x___, x___, y___),
e 3) | x___.req_type(AtomType.Var)) f(x___), x___.req_lt(
```

matches if `x___`

is smaller than 3 or is a variable. The example yields `f(2)`

.

#### Level restrictions

Sometimes a pattern should only match in a function or at a certain depth of the expression tree. This can be set with `level_range`

, which defines the inclusive range for which the pattern should match. For example:

```
= S('x', 'f')
x, f = x+x**2*f(x**2, x, f(x))
r print(r.replace_all(x, 1, level_range=(1, 1)))
```

yields `x+x^2*f(1,1,f(x))`

.

By default, the level signifies the function depth. It can be set to signify the depth in the expression tree by the optional argument `level_is_tree_depth=True`

.

#### Additional settings

Wildcards are *greedy* by default, i.e. they try to be as large as possible. In the example of the previous section we see that the result is `f(2)`

instead of `f(x)`

, because `x___`

is greedy. By passing the setting `non_greedy_wildcards`

we can change this behaviour:

```
= f(3, x, 2).replace_all(f(x___, x_, y___),
e 3) | x_.req_type(AtomType.Var),
f(x_), x_.req_lt(=[x___]) non_greedy_wildcards
```

This code yields `f(x)`

.

## Match iteration

It is possible to iterate through all matches without modifying the expression.

For example, here `f(x_)`

is matched:

```
from symbolica import *
= S('x_', 'f')
x_, f = f(1)+f(2)+f(3)
e
for m in e.match(f(x_)):
print(m[x_])
```

which yields

```
1
2
3
```

The `match`

function returns a dictionary with the wildcard as keys. The `match`

function also yields nested matches:

```
from symbolica import *
= S('x_', 'f')
x_, f = f(f(1))
e
for m in e.match(f(x_)):
print(m[x_])
```

which yields

```
f(1)
1
```

## Replacements

We have already seen many examples of replacing all occurrences of a pattern in this section. An iterator over *single* replacements can be generated as well:

```
from symbolica import *
= S('x_', 'f')
x_, f = f(1)*f(2)*f(3)
e for r in e.replace(f(x_), f(x_ + 1)):
print(r)
```

which yields

```
f(2)*f(2)*f(3)
f(1)*f(3)*f(3)
f(1)*f(2)*f(4)
```

## Transformers

Often the goal of a pattern match and replacement is to isolate a part of the expression and to modify it to something else. Rather than simply allowing one atom to be replaced by another, functions can be executed for the replacement instead. These are called *transformers* (see here for more information). A transformer takes a pattern with wildcards as an argument and executes an operation on it after the wildcards have been substituted.

For example:

```
from symbolica import *
= S('x', 'x_', 'f')
x, x_, f = f((x+1)**2)
e print(e.replace_all(f(x_), f(x_.transform().expand())))
```

yields

`f(x^2+2*x+1)`

## Replace function

The right-hand-side of a replacement can also be a function that maps wildcards to any expression. This can be used to do transformations that are not possible by just using `Expression`

, for example string manipulations. Here is an example:

```
= S('k', 'p', 'id1_', 'v1_', 'idx1_', 'id2_', 'v2_')
k, p, id1_, v1_, idx1_, id2_, v2_
= (k(1, 2)*p(3, 2)).replace_all(id1_(v1_, idx1_)*id2_(v2_, idx1_),
r lambda m: E("{}(mu{})*{}(mu{})".format(m[id1_], m[v1_], m[id2_], m[v2_])))
```

This maps `k(1, 2)*p(3, 2)`

into `k(mu1)*p(mu3)`

## Examples

#### Multiple matches

```
from symbolica import *
= S('x__', 'y__', 'z__', 'w__', 'a__', 'f', 'g')
x_, y_, z_, w_, a_, f, g = f(1,2,3,4)*g(5,2,3,6,7)
e for m in e.match(f(x_,y_,z_)*g(w_,y_,a_)):
print('{}\t{}\t{}\t{}\t{}'.format(m[x_], m[y_], m[z_], m[w_], m[a_]))
```

yields:

```
1 2 arg(3,4) 5 arg(3,6,7)
1 arg(2,3) 4 5 arg(6,7)
arg(1,2) 3 4 arg(5,2) arg(6,7)
```

#### Coefficient extraction

The coefficients of the polynomial `x^2*(y+z) + x^3*(y+z^2)`

in `x`

are matched:

```
from symbolica import *
= S('x', 'y', 'z', 'n_', 'x_')
x, y, z, n_, x_ = x**2*(y+z) + x**3*(y+z**2)
e
# match returns a dictionary that maps every wildcard to its value
= [(m[n_], m[x_]) for m in e.match(x**n_*x_)]
coeff_list for (pow, content) in coeff_list:
print(x**pow, content)
```

#### Unbound wildcards on the right-hand side

Symbolica supports wildcards appearing on the right-hand side, that do not appear in the pattern. This can be used to construct new patterns that depend on – for example – structural information in the expression.

In the example below, we want to find all automorphisms of a small graph, with vertices `v`

and edges `e_i`

. We replace all edges `e[i]`

with a unique wildcard `e[i]_`

and apply the new pattern to the graph we started out with.

```
= S('v', is_symmetric=True)
v = S('e0', 'e1', 'e2', 'e3', 'e4', 'e5', 'e6')
e = S('e0_', 'e1_', 'e2_', 'e3_', 'e4_', 'e5_', 'e6_')
e_ = S('x_')
x_ = v(e[0], e[1], e[4])*v(e[1], e[2], e[5], e[6]) * \
graph 0], e[2], e[3])*v(e[3], e[4], e[5])
v(e[
= graph
graph_pat for edge, edge_pat in zip(e, e_):
= graph_pat.replace_all(
graph_pat =True)
edge, edge_pat, allow_new_wildcards_on_rhs
print('Pattern: ', graph_pat)
for x in graph.match(graph_pat):
for ee in e_:
print('{}={}'.format(ee, x[ee]), end=' ')
print()
```

We obtain 6 automorphisms:

```
Pattern: v(e0_,e1_,e4_)*v(e0_,e2_,e3_)*
v(e3_,e4_,e5_)*v(e1_,e2_,e5_,e6_)
e0_=e0 e1_=e1 e2_=e2 e3_=e3 e4_=e4 e5_=e5 e6_=e6
e0_=e4 e1_=e1 e2_=e5 e3_=e3 e4_=e0 e5_=e2 e6_=e6
e0_=e0 e1_=e2 e2_=e1 e3_=e4 e4_=e3 e5_=e5 e6_=e6
e0_=e3 e1_=e2 e2_=e5 e3_=e4 e4_=e0 e5_=e1 e6_=e6
e0_=e3 e1_=e5 e2_=e2 e3_=e0 e4_=e4 e5_=e1 e6_=e6
e0_=e4 e1_=e5 e2_=e1 e3_=e0 e4_=e3 e5_=e2 e6_=e6
```