Writing a simple lexical analyzer for SQL - Part 1

Commencing today, we embark on a concise series delving into lexical analysis, with a specific emphasis on SQL. The aim is to unveil the intricacies of how a query compiler within a RDBMS operates, unraveling its ability to comprehend the underlying meaning of a query and discern the user's intended objectives.

In these articles, we will revisit the fundamentals of computer science, placing a strong emphasis on theoretical concepts such as finite automata and compilation. Our dual objective is twofold: firstly, to elucidate the inner workings of a lexical analyzer, and secondly, to apply these foundational principles to a tangible scenario — crafting a streamlined scanner tailored for SQL.

  • Lexical analysis, often referred to as scanning or tokenization, is the initial phase of the compilation process in computer science. It involves the examination of a sequence of characters (source code) to produce meaningful units called tokens. These tokens are the smallest units of a programming language with well-defined meanings.

  • SQL is a domain-specific language used for managing and manipulating relational databases. It is a standard programming language specifically designed for interacting with and managing data stored in a relational database management system (RDBMS).

In this discussion, we aim to unravel the intricacies of how a query compiler processes a SQL string and successfully generates a meaningful outcome. The focus will be on addressing the challenges inherent in this process.

We referred to the following books to elucidate certain concepts.

Introduction to Automata Theory, Languages, and Computation (Hopcroft, Motwani, Ullman)

Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom)

Without further ado and as usual, let's begin with a few prerequisites to correctly understand the underlying concepts. Continue here.