Writing a simple lexical analyzer for SQL - Part 4

In this concluding post, we will put into practice what was discussed earlier by implementing it in C#.

Defining tokens, categories and more

As we defined earlier what a category was, we now have to explicitly code it in C#. We will simply use enums for that.

1public enum Category { OPERATOR, KEYWORD, IDENTIFIER, PUNCTUATION, NUMBER, LITERAL, ALL }

We follow a similar process for tokens.

 1public class Token
 2{
 3    public string Lexeme { get; set; }
 4
 5    public Category Category { get; set; }
 6
 7    public override string ToString()
 8    {
 9        return $"({Category.ToString()}|{Lexeme})";
10    }
11}

That is quite straightforward, and with these definitions, we can now proceed to implement finite automata.

 1public class FSM
 2{
 3    public List<char> Sigma { get; set; }
 4
 5    public List<FSMState> States { get; set; }
 6
 7    public List<FSMTransition> Transitions { get; set; }
 8
 9    public FSM()
10    {
11        States = new List<FSMState>();
12        Transitions = new List<FSMTransition>();
13    }
14
15    public bool Simulate(string pattern)
16    {
17        var currentState = States.FirstOrDefault(x => x.IsStart);
18        foreach (var c in pattern.ToLower())
19        {
20            var transitions = Transitions.Where(t => t.Start.Id == currentState.Id).ToList();
21            foreach (var transition in transitions)
22            {
23                currentState = transition.Transition(c);
24                if (currentState != null) break;
25            }
26
27            if (currentState == null) return false;
28        }
29
30        return currentState.IsEnd;
31    }
32}
33
34public class FSMState
35{
36    public string Id { get; set; }
37
38    public bool IsStart { get; set; }
39
40    public bool IsEnd { get; set; }
41}
42
43public class FSMTransition
44{
45    public FSMState Start { get; set; }
46
47    public FSMState End { get; set; }
48
49    public char Item { get; set; }
50
51    public FSMState Transition(char c)
52    {
53        if (Item == c) return End;
54        return null;
55    }
56}
Information

Note that the FSM class contains a Simulate method; its goal is to take a string as an argument and return true or false based on whether the automaton recognizes this string or not.

Defining the lexical analyzer

First and foremost, we need to define a general interface for a lexical analyzer.

1public interface ILexicalAnalyzer
2{
3    bool TryAnalyze(string codeToScan, out List<Token> tokens, out List<string> errors); 
4}

This method is quite intuitive: a lexical analyzer takes a raw string as an argument and attempts to break it down into tokens. It also returns errors if there are any.

In our scenario, where we are implementing a lexical analyzer for SQL, we need to create a concrete class for that.

 1public class SqlLexicalAnalyzer : ILexicalAnalyzer
 2{        
 3    private Dictionary<string, FSM> _fsms;
 4    private Dictionary<string, IFSMBuilder> _builders;
 5
 6    public SqlLexicalAnalyzer()
 7    {
 8        _fsms = new Dictionary<string, FSM>();
 9
10        _builders = new Dictionary<string, IFSMBuilder>
11        {
12            { Category.OPERATOR.ToString(), new OperatorFSMBuilder() },
13            { Category.KEYWORD.ToString(), new KeywordFSMBuilder() },
14            { Category.IDENTIFIER.ToString(), new IdentifierFSMBuilder() },
15            { Category.NUMBER.ToString(), new NumberFSMBuilder() },
16            { Category.LITERAL.ToString(), new LiteralFSMBuilder() },
17            { Category.PUNCTUATION.ToString(), new PunctuationFSMBuilder() },
18            { Category.ALL.ToString(), new AllFSMBuilder() }
19        };
20
21        foreach (var category in Enum.GetNames(typeof(Category)))
22        {
23            var fsmBuilder = _builders[category];
24            _fsms.Add(category, fsmBuilder.Build());
25        }
26    }
27
28    public bool TryAnalyze(string codeToScan, out List<Token> tokens, out List<string> errors)
29    {
30        tokens = new List<Token>(); errors = new List<string>();
31
32        var partsSql = ParsePart(codeToScan.Trim());
33        foreach (var split in partsSql)
34        {
35            var found = false;
36            foreach (var category in _fsms.Keys)
37            {
38                var fsm = _fsms[category];
39                if (fsm.Simulate(split))
40                {
41                    tokens.Add(new Token() { Lexeme = split, Category = (Category)Enum.Parse(typeof(Category), category) });
42                    found = true;
43                    break;
44                }
45            }
46
47            if (!found) errors.Add($"The lexeme {split} is not recognized by the language.");
48        }
49        
50        return !errors.Any();
51    }
52
53    private string[] ParsePart(string part)
54    {
55        var specials = new List<string>()
56        {
57            " ", ",", ".", "<=", ">=", "<>", "="
58        };
59
60        var tempParts = new List<string>() { part };
61        foreach (var special in specials)
62        {
63            var temps = new string[tempParts.Count]; tempParts.CopyTo(temps, 0);
64            tempParts = new List<string>();
65            foreach (var t in temps)
66            {
67                var splits = t.Trim().Split(special, StringSplitOptions.None);                    
68                if (splits.Length == 1) tempParts.Add(splits[0]);
69                else
70                {
71                    foreach (var split in splits)
72                    {
73                        tempParts.Add(split);
74                        tempParts.Add(special);
75                    }
76
77                    tempParts.RemoveAt(tempParts.Count - 1);
78                }
79
80                tempParts = tempParts.Select(x => x.Trim()).Where(x => !string.IsNullOrWhiteSpace(x)).ToList();
81            }
82        }            
83
84        return tempParts.ToArray();
85    }
86}

This class generates an automaton for each category and implements the TryAnalyze method by simulating each automaton one by one until one recognizes the lexeme at hand. If none of them succeeds, we handle an error for this lexeme.

Running the program

  • SELECT * FROM Customers c WHERE c.Id = 10100
 1(KEYWORD|SELECT)
 2(ALL|*)
 3(KEYWORD|FROM)
 4(IDENTIFIER|Customers)
 5(IDENTIFIER|c)
 6(KEYWORD|WHERE)
 7(IDENTIFIER|c)
 8(PUNCTUATION|.)
 9(IDENTIFIER|Id)
10(OPERATOR|=)
11(NUMBER|10100)
  • SELECT c.Name, c.Age FROM Customers c WHERE c.Id = 10100 AND c.Age <= 35
 1(KEYWORD|SELECT)
 2(IDENTIFIER|c)
 3(PUNCTUATION|.)
 4(IDENTIFIER|Name)
 5(PUNCTUATION|,)
 6(IDENTIFIER|c)
 7(PUNCTUATION|.)
 8(IDENTIFIER|Age)
 9(KEYWORD|FROM)
10(IDENTIFIER|Customers)
11(IDENTIFIER|c)
12(KEYWORD|WHERE)
13(IDENTIFIER|c)
14(PUNCTUATION|.)
15(IDENTIFIER|Id)
16(OPERATOR|=)
17(NUMBER|10100)
18(KEYWORD|AND)
19(IDENTIFIER|c)
20(PUNCTUATION|.)
21(IDENTIFIER|Age)
22(OPERATOR|<)
23(NUMBER|35)
  • Customers SELECT WHERE Id FROM
1(IDENTIFIER|Customers)
2(KEYWORD|SELECT)
3(KEYWORD|WHERE)
4(IDENTIFIER|Id)
5(KEYWORD|FROM)
Information

In SQL, this example is entirely nonsensical, yet it adheres to lexical correctness. It will be the responsibility of the parser to identify the error.

  • SELECT * FROM Customers04
1The lexeme Customers04 is not recognized by the language.
Information

This example violates the lexical rules defined for identifiers as it includes numeric characters.

Final thoughts

The lexical analyzer represents only the initial phase in the compilation process. It is succeeded by the syntax analyzer (or parser), which ensures that the output produced adheres to the grammar rules. Grammars and context-free languages constitute another intriguing field, but we will reserve that for a future article.

If you wish to delve deeper into this topic, acquire the following books, which encompass all the concepts emphasized in this series and delve into more advanced ones.

Introduction to Automata Theory, Languages, and Computation (Hopcroft, Motwani, Ullman)

Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom)

Do not hesitate to contact me shoud you require further information.