How to implement support for regular expressions in Azure Data Factory ?
Mar 15, 2024· 7 min read ·
Share on:
In this article, we will look at how to add regular expression support in Azure Data Factory, going beyond its standard string functions. The solution ? In short, with some extra effort to set up an Azure Function, we will show how to achieve a refined and flexible approach.
What is Azure Data Factory ?
Azure Data Factory (ADF) is a cloud-based data integration service provided by Microsoft Azure. It allows users to create, schedule, and orchestrate data workflows, enabling the extraction, transformation, and loading (ETL) of data from various sources into destinations such as databases, data lakes, and data warehouses.
Azure Data Factory is indeed a robust platform that simplifies the development of workflows involving datastores. Our focus though is not to extensively explore this platform but rather to explore ways to extend its capabilities and harness its full potential.
Information
For a comprehensive introduction to Azure Data Factory and an exploration of the extensive possibilities it offers, we recommend the following book. It covers fundamental concepts as well as more advanced topics, all illustrated with concrete examples.
How can string matching be performed in Azure Data Factory ?
Despite its capabilities, Azure Data Factory occasionally encounters limitations in string manipulation, making it challenging to execute certain scenarios accurately. For instance, envision a use case where specific files need to be transferred from one blob storage to another—say, files that begin with "IT", include the term "BANK", and end with "TXT".
In the subsequent discussion, we will assume that the file name is provided, which can be obtained through means such as a storage event trigger. In the context of Azure Data Factory (ADF), this implies that the file name is treated as a parameter within the pipeline.
Hence, we can utilize a Filter activity and attempt to apply the aforementioned specific rules using built-in string functions.
In such a scenario, the existing approach may suffice if the requirement remains relatively straightforward. However, consider a scenario where we also need files ending with "csv" or "CSV" in a case-insensitive manner.We’re starting to grasp the complexity involved: as the requirements grow more complex, relying solely on built-in functions for translation becomes increasingly cumbersome. Furthermore, accommodating potential changes from the business analyst, such as adding additional conditions, may eventually reach a point where meeting expectations becomes unfeasible.Yes, this example may be somewhat exaggerated. However, real-world scenarios can indeed arise where built-in capabilities are insufficient to meet our requirements. And the issue lies not in the complexity of these conditions themselves; rather, it is challenging to implement them within Azure Data Factory.
What are regular expressions ?
Regular expressions are sequences of characters that form a search pattern. They are used to match strings or substrings within text, enabling flexible search, replace, and validation operations. Regular expressions allow us to specify complex text patterns using special characters and symbols. They are powerful tools commonly used in programming, data processing, validation, and text editing. For instance, regex patterns can verify email formats, locate specific patterns in logs, or extract phone numbers from text files.
Information
Regular expressions are powerful tools, though they can seem intimidating at first. In this post, we will assume that readers are already familiar with how to use them and know how to implement them in their preferred programming language.
As an example, implementing the previous conditions using regular expressions would be relatively straightforward. Here's an example written in C# code.
As demonstrated, the process is straightforward (even though the provided example is overly simplistic), and this feature is commonly offered by most programming languages for convenient string processing. However, Azure Data Factory does not support the use of regular expressions, necessitating the exploration of alternative strategies to achieve the desired outcome.
Using Azure Functions to implement regular expressions
In the scenario above, we utilized C# code to illustrate the ease of handling strings with regular expressions. Consequently, a natural question arises: could we incorporate C# code (and use regular expressions) into Azure Data Factory to leverage it for string manipulation ? To put it another way, is it possible to have the best of both worlds ?
The answer is affirmative, achieved through Azure Functions.
What is an Azure Function ?
Azure Function is a serverless compute service provided by Microsoft Azure. It allows developers to build and deploy small pieces of code, known as functions, that can be triggered by various events such as HTTP requests, timer schedules, or changes in data within Azure services. Azure Functions support multiple programming languages including C#, JavaScript, Python, and Java, enabling developers to choose the language they are most comfortable with.
Functions can be quickly developed, tested, and deployed without worrying about managing the underlying infrastructure.
This freedom releases you from a need to create a special infrastructure to host this development environment.(...) With the addition of Azure Functions, ADF's potential greatly increases. Azure Data Factory Cookbook
As shown in the image below, Azure Data Factory natively supports the integration of Azure Functions within a pipeline.
Conclusion: albeit with the added complexity of incorporating an Azure Function into the infrastructure, we could effectively manage regular expressions. But how can we utilize concretely this functionality to accomplish our objective ?
Creating an Azure Function
The initial step is to add an Azure Function. While we won't delve into the specifics here, this can be accomplished either manually or, preferably, through the use of tools such as Bicep, Terraform, or another form of infrastructure as code. For the purpose of this discussion, we will assume that the function has been created.
Information
We will demonstrate with C#, although it's worth noting that similar functionality can be implemented using other languages such as Java or PHP.
Adding the MatchNameWithPattern function
We'll incorporate the following code into the Azure Function. This code performs a simple task: it verifies whether a string conforms to a specified regular expression pattern. Additionally, it's important to mention that this function is invoked by an HTTP request using the POST method.
Now, we can integrate this Azure Function into the pipeline.
Add a linked service and choose Azure Function
Configure the linked service
Add an Azure Function activity in the pipelineIn this context, we make a POST request to our Azure Function, providing the necessary parameters. We ensure that the request aligns with the signature of the function we previously created.
Important
In this activity, we specify the pattern for string matching. Essentially, we offload the task of validating the expression from ADF to a more powerful external tool (Azure Function).
Add an If Condition activity
The If Condition activity will allow us to proceed with the pipeline or halt its execution based on the result obtained from the Azure Function.
Information
The If Condition activity provides the same functionality that an if statement provides in programming languages. It executes a set of activities when the condition evaluates to true and another set of activities when the condition evaluates to false.
In the Expression field, we utilize the outcome of our Azure Function to trigger the appropriate workflow branch.And that's it ! We can now validate arbitrarily complex conditions in Azure Data Factory. Additionally, this Azure Function can be reused across any other pipelines we need to develop. In this sense, the added complexity we mentioned earlier is well worth it.
Final thoughts
In this post, we outlined the process of configuring regular expressions in Azure Data Factory using an Azure Function. Our solution effectively addresses the issue, though it requires a more intricate infrastructure setup. However, it's well worth the effort, especially since an Azure Function is relatively easy to implement. This development can be coded once and then reused by other teams.