How to implement support for regular expressions in Azure Data Factory ?

In this article, we'll explore enabling regular expression support in Azure Data Factory, surpassing the string functions provided by the platform. By investing additional effort into implementing an Azure Function, we'll demonstrate the feasibility of establishing a highly refined and adaptable solution.

What is Azure Data Factory ?

Azure Data Factory (ADF) is a cloud-based data integration service provided by Microsoft Azure. It allows users to create, schedule, and orchestrate data workflows, enabling the extraction, transformation, and loading (ETL) of data from various sources into destinations such as databases, data lakes, and data warehouses.

ADF extracts, transforms and loads data from a datastore to another one.

Azure Data Factory is indeed a robust platform that simplifies the development of workflows involving datastores. Our focus though is not to extensively explore this platform but rather to explore ways to extend its capabilities and harness its full potential.

What is the connection between ADF and regular expressions ?

Despite its capabilities, Azure Data Factory occasionally encounters limitations in string manipulation, making it challenging to execute certain scenarios accurately. For instance, envision a use case where specific files need to be transferred from one blob storage to another—say, files that begin with "IT", include the term "BANK", and end with "TXT".

Only specific files must be processed.

In the subsequent discussion, we will assume that the file name is provided, which can be obtained through means such as a storage event trigger. In the context of Azure Data Factory (ADF), this implies that the file name is treated as a parameter within the pipeline.

Hence, we can utilize a Filter activity and attempt to apply the aforementioned specific rules using built-in string functions.

In such a scenario, the existing approach may suffice if the requirement remains relatively straightforward. However, consider a scenario where we also need files ending with "csv" or "CSV" in a case-insensitive manner.

As the requirements grow more complex, relying solely on built-in functions for translation becomes increasingly cumbersome. Furthermore, accommodating potential changes from the business analyst, such as adding additional conditions, may eventually reach a point where meeting expectations becomes unfeasible.
The issue lies not in the complexity of these conditions themselves; rather, it is challenging to implement them within Azure Data Factory. Implementing these conditions using regular expressions, however, would be relatively straightforward. Here's an example written in C# code.

1var r = new Regex("IT(.*)BANK(.*).txt");
2var res = r.IsMatch("IT_123f_BANK_abcd.txt"); // returns true

As demonstrated, the process is straightforward (even though the provided example is overly simplistic), and this feature is commonly offered by most programming languages for convenient string processing. However, Azure Data Factory does not support the use of regular expressions, necessitating the exploration of alternative strategies to achieve the desired outcome.

Using Azure Functions to implement regular expressions

In the example above, we utilized C# code to illustrate the ease of handling strings with regular expressions. Consequently, a natural question arises: could we incorporate C# code into the pipeline to leverage it for string manipulation ? The answer is affirmative, achieved through Azure Functions.

What is an Azure Function ?

Azure Function is a serverless compute service provided by Microsoft Azure. It allows developers to build and deploy small pieces of code, known as functions, that can be triggered by various events such as HTTP requests, timer schedules, or changes in data within Azure services. Azure Functions support multiple programming languages including C#, JavaScript, Python, and Java, enabling developers to choose the language they are most comfortable with.

Functions can be quickly developed, tested, and deployed without worrying about managing the underlying infrastructure.

Therefore, albeit with the added complexity of incorporating an Azure Function into the infrastructure, we could effectively manage regular expressions. How can we utilize concretely this functionality to accomplish our objective ?

Creating an Azure Function

The initial step is to add an Azure Function. While we won't delve into the specifics here, this can be accomplished either manually or, preferably, through the use of tools such as Bicep, Terraform, or another form of infrastructure as code. For the purpose of this discussion, we will assume that the function has been created.

Information

We will demonstrate with C#, although it's worth noting that similar functionality can be implemented using other languages such as Java or PHP.

Adding the MatchNameWithPattern function

We'll incorporate the following code into the Azure Function. This code performs a simple task: it verifies whether a string conforms to a specified regular expression pattern. Additionally, it's important to mention that this function is invoked by an HTTP request using the POST method.

 1[FunctionName("MatchNameWithPattern")]
 2public async Task<IActionResult> MatchNameWithPattern([HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = null) HttpRequest req, ILogger log)
 3{
 4    var data = await new StreamReader(req.Body).ReadToEndAsync();
 5    var regexModel = JsonConvert.DeserializeObject<RegexModel>(data);
 6	
 7    var r = new Regex(regexModel.Pattern);
 8    var res = r.IsMatch(regexModel.Name);
 9	
10    return new OkObjectResult(res);
11}
12
13public class RegexModel
14{
15    public string Pattern { get; set; }
16	
17    public string Name { get; set; }
18}

Using the Azure Function in the pipeline

Now, we can integrate this Azure Function into the pipeline.

  • Add a linked service and choose Azure Function

  • Configure the linked service

  • Add an Azure Function activity in the pipeline

    In this context, we make a POST request to our Azure Function, providing the necessary parameters. We ensure that the request aligns with the signature of the function we previously created.

  • Add an If Condition activity

The If Condition activity will allow us to proceed with the pipeline or halt its execution based on the result obtained from the Azure Function.

Information

The If Condition activity provides the same functionality that an if statement provides in programming languages. It executes a set of activities when the condition evaluates to true and another set of activities when the condition evaluates to false.

In the Expression field, we utilize the outcome of our Azure Function to trigger the appropriate workflow branch.

Final thoughts

In this post, we outlined the process of configuring regular expressions in Azure Data Factory using an Azure Function. While our solution effectively addresses the issue, it does require a more intricate infrastructure setup.