02.12.2022 - Parsing/Backus-Naur Form and variants

All posts

Posted On 02.12.2022

Backus-Naur Form (BNF) is a metasyntax for context-free grammars, is used to describe the syntax of languages used in computing such as programming languages, instruction sets,…

The BNF syntax

The specification of BNF is quite simple, it’s a set of derivation rules written as:

<symbol> ::= expression1 | expression2

Where:

  • <symbol> is the nonterminal, they are always enclosed between the pair <>
  • expressions consists of one or more sequences of either terminals or nonterminals.
  • ::= mean the symbol on the left must be replaced by the expression on the right.
  • If a nonterminal can be replaced by multiple expressions, they can be separated by the vertical bar “|”

For example, the following grammar define a signed integer number:

<signed integer> ::= <integer> | <sign> <integer>
<integer> ::= <digit> | <integer> <digit>
<digit>  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<sign>   ::= -

The Extended BNF syntax

There are many variants of BNF, one of the most popular variants is Extended BNF (EBNF). It added some other symbols so we can explain the grammar better, like the sequence symbol (,), repeated symbol ({}), optional symbol ([]),…

In EBNF, nonterminal does not necessarily enclose between the pair <>, and the definition symbol ::= can be written as =. Terminals are strictly enclosed within the quotation marks ("" or ''). Each production rule is terminated by the semicolon ;, and they can contain comments ((* *)) too:

(* this is a comment *)
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

A concatenation symbol is used to express the sequence of terminals, each separated by a comma (,), for example:

twelve = "1", "2" ;
ten = "1", "0" ;
one ten = "1", ten ;

If a symbol needed to be repeated (zero or more times), we can use {} symbols. For example, the following definition matches all powers of ten, including 10, 100, 1000, 10000,…

powers of ten = one, zero, { zero } ;
one  = "1" ;
zero = "0" ;

Optional can be expressed by [], everything that is optional can be present just once or not at all, for example:

signed digit = [ "-" ], digit ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;