Data modeling in Azure CosmosDb - Part 3

In this article, we initiate the transition from a relational schema to a document-oriented one (Azure Cosmos DB). In this context, we delve into the theoretical concepts, postponing the discussion of practical implementation details to the subsequent post.

What is a document-oriented database ?

A document-oriented database is a type of NoSQL database that is designed to store, retrieve, and manage semi-structured or unstructured data, typically in the form of documents. Unlike traditional relational databases that store data in tables with predefined schemas, document-oriented databases store data in flexible, schema-less documents.

What is Azure Cosmos DB ?

  • Azure Cosmos DB is a multi-model, globally distributed database service provided by Microsoft. It is part of Azure database offerings and is designed to enable developers to build highly responsive and scalable applications with low-latency access to data worldwide. Azure Cosmos DB supports multiple data models, including document, key-value, graph, column-family, and wide-column store, making it a multi-model database. Here, we intend to utilize only the document-oriented functionality of this database.

  • Azure Cosmos DB supports various data models, it is built to scale both throughput and storage horizontally, allowing you to scale resources up or down based on demand, automatically indexes all the data, providing efficient and high-performance query capabilities.

This is the business subject presented by Microsoft (certainly, it is not unfounded, and we will explore later the significant importance of these features and why Microsoft places a strong emphasis on them). However, in this article, we will depend on the crucial concept of partitioning, which will empower us to build a highly scalable and efficient application.

What is partitioning ?

In Azure Cosmos DB, partitioning is a fundamental concept that involves distributing our data across multiple physical partitions to achieve scalability and performance.

  • A physical partition is the underlying storage structure in the Cosmos DB service. Each physical partition is a unit of physical resource allocation (compute, storage, and throughput). The goal is to evenly distribute the provisioned throughput and data storage across these physical partitions. As our data and throughput requirements increase, Azure Cosmos DB will automatically add more physical partitions to distribute the load.

  • A logical Partition is a unit of data and a container for a set of items that share the same partition key value. The partition key is a property chosen by the user when creating a container. It is used to determine the logical partition to which an item belongs. The choice of partition key is critical as it affects the distribution of data across physical partitions and can impact query performance.

Very important

All items within a logical partition are guaranteed to be collocated on the same physical partition.

Very important

Queries that span multiple partitions (cross-partition queries) may result in higher latency and consume more throughput because they require coordination across multiple physical partitions.

Give me an example !

All of that is somewhat cumbersome, tedious, and verbose, so let's simplify with a concise example. Imagine a situation where we need to store information about car models.

Our objective is to manage the storage of cars and their various features.

Cars will be stored within a container in Azure Cosmos DB, and it is imperative that we proactively specify the partition key during the container creation process. One viable option is to capitalize on attributes such as the car model for this purpose.

The data are distributed among distinct logical partitions based on the designated partition key.

One or more logical partitions can be mapped to a single physical partition. A single physical partition can accommodate one or more logical partitions, and conversely, one or more logical partitions can be associated with a single physical partition. However, each individual logical partition will consistently correspond to one and only one physical partition.

One or more logical partitions can be mapped to a single physical partition.

Query by car model

Consider a scenario where our objective is to retrieve a specific car model. The query is straightforward, and our model was crafted to efficiently support it. In this case, only one logical partition will be accessed, ensuring high efficiency.

Only one logical partition is accessed.

Query by country

Now, envision a situation where we aim to retrieve all the cars associated with a particular country, such as Japan. Given that the country attribute is distributed across multiple partitions, this scenario exemplifies a cross-partition query.

Several partitions are accessed: cross-partition query

Cross-partition queries are not inherently detrimental, and there may be situations where resorting to them is unavoidable. However, their frequent execution or involvement of numerous accessed partitions can adversely impact performance. It's crucial to exercise caution, especially considering that Azure Cosmos DB operates on a pricing model tied to CPU utilization—increased usage of computational resources correlates with higher costs.

These considerations lead us to the following general guideline.

Very important

When designing our Cosmos DB database, we should carefully choose the right partition key based on our access patterns and query requirements. A good partition key distributes data evenly across partitions and minimizes cross-partition queries. Partitioning is a key feature that enables Azure Cosmos DB to provide elastic scalability and high-performance access to our data, making it suitable for globally distributed and highly demanding applications.

As developers, our responsibility will ultimately entail meticulously dividing data into well-considered partitions, thereby necessitating thoughtful design of containers.

But how to do that ?

Information

Certainly, as a developer, it's not always feasible to predict every potential query in advance. The dynamic nature of user interactions and requirements can introduce unforeseen queries.

Indeed, while it's impossible to anticipate every conceivable query or unforeseen need, proactive consideration of the most significant and frequently occurring ones is a pragmatic approach. Acknowledging that some less frequent queries may not be as efficient, users often understand this trade-off. In engineering, such trade-offs are recognized as compromises, reflecting the balancing act between optimizing for common use cases and maintaining flexibility for less frequent scenarios.

Absolutely, we touch here upon a fundamental aspect of document-oriented databases, and the NoSQL paradigm in general. Unlike relational databases where the structure is more rigid and schema-focused, in NoSQL databases, especially document-oriented ones, there's a need to think about queries upfront. The flexibility offered by NoSQL databases comes with the responsibility of designing data structures and choosing partition keys that align with the anticipated queries and usage patterns. This proactive consideration allows for better performance and scalability in handling diverse data access needs.

In summary

It's important to design our data model and queries with the partitioning strategy in mind to optimize performance.

And concretely ?

When contemplating the earlier-discussed scenarios, what should be our approach to partitioning? The decision hinges on identifying the most frequent query, which, in reality, is contingent on the specific application's requirements. Ultimately, the solution may involve creating two containers, each employing its own partition strategy—one for the car model and the other for the country. The application will be tasked with determining the appropriate container, streamlining the process without necessitating prolonged deliberation on our part.

Now, envision the scenario where we need to compile a list of all the dealers associated with each car model—such as identifying which dealer is selling a Renault Clio V or a Ford Mondeo. In this instance, it would be advantageous to utilize a container with a model-centric strategy. The good news is that we already have such a container in place! All that remains is to incorporate our dealers into this existing container.

Now, within the container partitioned by model, we find two types of entities: the specified cars and various dealers. However, the challenge lies in distinguishing between them. The solution is straightforward: by introducing a new property into each item.

Important

This is where the Microsoft commercial pitch becomes instrumental and completes the loop: enabling automatic indexing for ALL attributes facilitates swift querying within a specific partition (as in our case with the type property), while simultaneously offering the flexibility to store schemaless data in the same container, encompassing both dealers and cars.

Can we do better ?

When incorporating dealers into our application, an alternative approach could involve directly associating each dealer with the corresponding car item.

This approach aligns with the document-oriented paradigm and can be considered a form of denormalization. While not inherently problematic, caution is warranted. If the number of dealers is substantial (which is likely to be the case in our example), adopting this method results in oversized items for each car, demanding significant bandwidth. Consequently, this approach is best suited for scenarios where the list of associated objects remains reasonably sized.

It's time to implement these concepts in practice, as discussed in the final article.