Writing a simple lexical analyzer for SQL - Part 2

In this post, we explore the theoretical concepts essential for a precise understanding of the underlying mechanisms of a lexical analyzer.

What is a lexical analyzer ?

To start, an example is more illustrative than a lengthy explanation. Therefore, let's consider the following three sentences.

  • Tha ader _guipoi je ?dftre !!klm.
  • Question be not address book this.
  • The cow is writing balloons.

The first sentence is incomprehensible, at least in English and likely in all languages worldwide. Additionally, certain words commence with illegal characters, such as exclamation marks, for instance.

The second sentence uses valid English words, but the arrangement in which they are employed renders it completely nonsensical.

The third sentence employs valid English words, and their arrangement is grammatically correct. However, the combination of words results in gibberish, making it nonsensical for anyone.

In a more formal lexicon, the first sentence is deemed lexically incorrect, the second one is lexically correct but syntactically incorrect, and the last one is lexically correct, syntactically correct but semantically incorrect.

The same principles that apply to a human language can be extended to programming languages, and the process is identical: a program needs to utilize permitted words (lexical analysis), organize them in a proper sequence according to a predefined grammar (syntax analysis), and avoid illogical assignments (semantic analysis).

In computer science, the three passes outlined above constitute the core components of a compiler and comprise a significant portion of a compilation course. Throughout this series, our focus will be solely on examining the lexical analyzer.

To be remembered

The lexical analyzer is a process examining the input source code of a programming language and breaking it down into a sequence of tokens. These tokens represent the fundamental building blocks of the language, such as keywords, identifiers, literals, and operators.

In essence, the lexical analyzer plays a crucial role by transforming raw source code into a structured stream of tokens that can be further processed by subsequent stages of a compiler.

How does one perform lexical analysis ?

A lexical analyzer initiates its process with a string, representing the raw source code written by a programmer (or, in our analogy above, the sentence spoken by a human). The objective is to determine the sequence of corresponding tokens.

What is a token ?

  • Each individual component of the input string, typically separated by whitespaces, is commonly referred to as a lexeme. In the given sentence, for instance, 'SELECT', '*', and 'Customers' constitute distinct lexemes.

  • Each lexeme corresponds to a lexical category in the language in question, such as being a keyword, identifier, operator, or literal. To illustrate, 'SELECT' is a keyword, 'Customers' is an identifier, '=' represents an operator, and so forth.

  • A token is the amalgamation of these two pieces of information: the current lexeme and its associated category. Consequently, examples of tokens would be (KEYWORD|SELECT) or (IDENTIFIER|Customers).

It would seem relatively straightforward at first glance to detect keywords, given that their numbers are finite in each language. The code snippet below, for instance, could accomplish this task.

 1var keywords = [ "SELECT", "FROM", "WHERE" ];
 2var input = "SELECT * FROM Customers";
 3var lexemes = input.Split(new string[] { " " }, StringSplitOptions.None);
 4var tokens = new List<Token>();
 5foreach (var lexeme in lexemes)
 6{
 7    if (keywords.Contains(lexeme))
 8    {
 9        var token = new Token() { Lexeme = lexeme, Category = "KEYWORD" };
10        tokens.Add(token);
11    }
12}

However, challenges arise when it comes to detecting identifiers. As identifiers can essentially be any sequence of characters, it is impractical to predefine all possible user-created identifiers, unlike the approach used in the code above. This is where a specialized data structure comes into play – the finite automaton.

Writing a simple lexical analyzer for SQL - Part 3