Writing a simple lexical analyzer for SQL - Part 1

Jan 10, 2024 · 2 min read ·

Share on:

Commencing today, we embark on a concise series delving into lexical analysis, with a specific emphasis on SQL. The aim is to unveil the intricacies of how a query compiler within a RDBMS operates, unraveling its ability to comprehend the underlying meaning of a query and discern the user's intended objectives.

What are we aiming to achieve ?

In these articles, we will revisit the fundamentals of computer science, placing a strong emphasis on theoretical concepts such as finite automata and compilation. Our dual objective is twofold: firstly, to elucidate the inner workings of a lexical analyzer, and secondly, to apply these foundational principles to a tangible scenario — crafting a streamlined scanner tailored for SQL.

Lexical analysis, often referred to as scanning or tokenization, is the initial phase of the compilation process in computer science. It involves the examination of a sequence of characters (source code) to produce meaningful units called tokens. These tokens are the smallest units of a programming language with well-defined meanings.
SQL is a domain-specific language used for managing and manipulating relational databases. It is a standard programming language specifically designed for interacting with and managing data stored in a relational database management system (RDBMS).

How does a query compiler interpret a SQL string and transform it into a meaningful result ? This is the challenge we will now explore.

We referred to the following books to elucidate certain concepts.

Introduction to Automata Theory, Languages, and Computation (Hopcroft, Motwani, Ullman)
Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom)

Without further ado and as usual, let's begin with a few prerequisites to correctly understand the underlying concepts. Continue here.