05.08.2022 - Parsing/Recursive Descent Parser

All posts

Posted On 05.08.2022

A parser is a program that usually takes a stream of lexical tokens and transforms them into another data structure, usually in the form of a parse tree that satisfies the language’s grammar rules.

What is a Recursive Descent Parser?

Recursive Descent Parser is a top-down parser where every non-terminal in the BNF grammar is a subroutine. The parser works by recursively calling each subroutine to construct the parsed output. It’s not the only algorithm to implement a parser, but it’s one of the most simple ones that are very easy to understand and implement.

For example, let’s say we have a grammar to parse money amount in USD, GBP, and EUR. The money amount should be written in the form of <currency_symbol> <amount>, like $100:

money           = currency_symbol amount ;
currency_symbol = '$' | '£' | '€' ;
amount          = INTEGER ;

The grammar has three non-terminals: money, currency_symbol, and amount. When implemented, we should also implement three parsing methods: parse_money(), parse_currency_symbol(), and parse_amount(). Each of the parsing methods will call each other just like how they’re related in the grammar rules:

type ParseResult<T> = Result<T, ParseError>;
 
impl<'a> Parser<'a> {
    /* amount = INTEGER ; */
    fn parse_amount(&mut self) -> ParseResult<i32> {
        ...
    }
 
    /* 	currency_symbol = '$' | '£' | '€' ; */
    fn parse_currency_symbol(&mut self) -> ParseResult<Currency> {
        ...
    }
 
    /* money = currency_symbol amount ; */
    fn parse_money(&mut self) -> ParseResult<MoneyNode> {
        let currency = self.parse_currency_symbol()?;
        let amount = self.parse_amount()?;
        return Ok(MoneyNode { currency, amount });
    }
}

Implementing a Money Parser

Let’s dig deeper into the above example. We will focus on the parser. Let’s assume that we already have a lexer that converts an input string like "$100" into a list of tokens.

Data Structures

In this program, we have two types of tokens: the CurrencySymbol token and the Number token:

#[derive(Debug, PartialEq, Clone, Copy)]
enum TokenType {
    CurrencySymbol,
    Number
}
 
#[derive(Debug)]
struct Token<'a> {
    token_type: TokenType,
    content: &'a str
}
 
impl<'a> Token<'a> {
    pub fn new(token_type: TokenType, content: &'a str) -> Self {
        Self {
            token_type,
            content
        }
    }
}

The input for our parser is a Vec<Token> that looks like this:

// List of tokens for the string "£128"
let tokens = vec![
    Token::new(TokenType::CurrencySymbol, "£"),
    Token::new(TokenType::Number, "128")
];

The output of our parser is a data structure called MoneyNode:

#[derive(Debug, PartialEq)]
enum Currency {
    USD,
    GBP,
    EUR
}
 
#[derive(Debug, PartialEq)]
struct MoneyNode {
    currency: Currency,
    amount: i32
}

Error Handling

For this grammar, there are two types of errors that could happen during parsing:

  1. Unexpected token error: when the parser found a token that was misplaced.
  2. Invalid amount error: when the parser could not parse the amount number.

We will create a custom error type and a Result<T> type to handle these two errors:

#[derive(Debug, PartialEq)]
enum ParseError {
    UnexpectedToken(TokenType, TokenType),
    InvalidAmount
}
 
impl Display for ParseError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            Self::UnexpectedToken(expected, found) =>
                write!(f, "Unexpected Token: Expected {:?}. Found {:?}.", expected, found),
            Self::InvalidAmount =>
                write!(f, "Invalid Amount!"),
        }
    }
}
 
type ParseResult<T> = Result<T, ParseError>;

The Parser and some utility methods

Now, let’s create the parser. It takes a list of input tokens and uses a pos variable to keep track of the current token.

struct Parser<'a> {
    tokens: Vec<Token<'a>>,
    pos: usize
}

We will implement some utility methods to control the input token stream, like:

  1. is_eof(): to check if we’re at the end of the token stream or not
  2. peek(): to get the current token
  3. is_match(): to check if the current token matched the expected type or not
  4. advance(): to consume the current token and move on to the next

These methods are not exclusive to a recursive descent parser, but they’re very helpful, as they keep the actual parsing code looks clean:

impl<'a> Parser<'a> {
    pub fn new(tokens: Vec<Token<'a>>) -> Self {
        Self {
            tokens,
            pos: 0
        }
    }
 
    fn is_eof(&self) -> bool {
        self.pos >= self.tokens.len()
    }
 
    fn peek(&self) -> &Token {
        &self.tokens[self.pos]
    }
 
    fn is_match(&self, token_type: TokenType) -> bool {
        !self.is_eof() && self.peek().token_type == token_type
    }
 
    fn advance(&mut self) {
        self.pos += 1;
    }
}

The Recursive Descent Parser

Now, the main part of the parser that we are waiting for. First, let’s write a parser for the amount rule of the grammar:

// Grammar:
//   amount = INTEGER ;
fn parse_amount(&mut self) -> ParseResult<i32> {
    let token = self.peek();
    if self.is_match(TokenType::Number) {
        let result = token.content.parse::<i32>()
                          .map_err(|_| ParseError::InvalidAmount);
        self.advance();
        return result;
    }
    Err(ParseError::UnexpectedToken(
            TokenType::Number,
            token.token_type
        )
    )
}

We simply check if the current token is a Number token or not and parse the content of this token into an i32 number. In this method, we can see the usage of both the InvalidAmount and the UnexpectedToken errors.

Next, we will write the parser for the currency_symbol rule:

// Grammar:
//   currency_symbol = '$' | '£' | '€' ;
fn parse_currency_symbol(&mut self) -> ParseResult<Currency> {
    let token = self.peek();
    if self.is_match(TokenType::CurrencySymbol) {
        let currency_symbol = match token.content {
            "$" => Currency::USD,
            "£" => Currency::GBP,
            _ => Currency::EUR
        };
        self.advance();
        return Ok(currency_symbol);
    }
    Err(ParseError::UnexpectedToken(
            TokenType::CurrencySymbol,
            token.token_type
        )
    )
}

Now that we have the parser for both the currency_symbol and amount rules. The last step is to write the parser for the money rule, it is implemented the same way the money rule is written. We will call the currency_symbol parser, then call the amount parser.

None of the above parsers will return any error value for valid input. Their return value can be combined to create the output MoneyNode object:

// Grammar:
//   money = currency_symbol amount ;
fn parse_money(&mut self) -> ParseResult<MoneyNode> {
    let currency = self.parse_currency_symbol()?;
    let amount = self.parse_amount()?;
    return Ok(MoneyNode {
        currency,
        amount
    });
}

And that’s it! We have already finished our parser!

Test the parser

Now, let’s write some tests to see if the parser works or not. First, in a happy path, we will pass a valid token list and expect to see a valid output:

#[test]
fn test_parse_usd() {
    let tokens = vec![
        Token::new(TokenType::CurrencySymbol, "$"),
        Token::new(TokenType::Number, "512")
    ];
    let mut parser = Parser::new(tokens);
    assert_eq!(parser.parse_money(), Ok(MoneyNode {
        currency: Currency::USD,
        amount: 512
    }))
}

Of course, if the input currency is EUR instead of USD, the parser should return the correct value:

#[test]
fn test_parse_eur() {
    let tokens = vec![
        Token::new(TokenType::CurrencySymbol, "€"),
        Token::new(TokenType::Number, "9372")
    ];
    let mut parser = Parser::new(tokens);
    assert_eq!(parser.parse_money(), Ok(MoneyNode {
        currency: Currency::EUR,
        amount: 9372
    }))
}

Don’t forget some unhappy paths, the parser should return an Err value if any of the parsing steps fails:

#[test]
fn test_parse_unexpected_token() {
    let tokens = vec![
        Token::new(TokenType::Number, "512"),
        Token::new(TokenType::CurrencySymbol, "$"),
    ];
    let mut parser = Parser::new(tokens);
    assert_eq!(parser.parse_money(), Err(
        ParseError::UnexpectedToken(
            TokenType::CurrencySymbol,
            TokenType::Number
        )
    ))
}
 
#[test]
fn test_parse_invalid_amount() {
    let tokens = vec![
        Token::new(TokenType::CurrencySymbol, "$"),
        Token::new(TokenType::Number, "3rr0r"),
    ];
    let mut parser = Parser::new(tokens);
    assert_eq!(parser.parse_money(), Err(ParseError::InvalidAmount))
}

Run the test with the cargo test command, and you should see all tests are passed:

running 5 tests
test test_parse_eur ... ok
test test_parse_gbp ... ok
test test_parse_invalid_amount ... ok
test test_parse_usd ... ok
test test_parse_unexpected_token ... ok
 
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

You can see the complete source code of the parser in this gist.


In this article, we learned what a Recursive Descent Parser is and how to implement the parser for each grammar rule, which serves as building blocks for each other. In the next article, we will look at a more complex parser for solving arithmetic expressions, which will give us a better look at the recursive characteristics of the Recursive Descent Parsing technique.