Schema Design Optimization / Scale Your Data Model
While one data model may work for data stored in a single replica set, the same model may cause issues when migrated to a sharded cluster.
As your business scales your operations to include sharded deployments, it's important to understand how to adapt your schema to this change to ensure performance and scalability.
In this lesson, we'll discuss some data modeling considerations you should make when scaling horizontally.
Then we'll walk you through a scenario that demonstrates how impactful schema design patterns can be in sharded environments.
MongoDB's golden rule that data that is accessed together should be stored together is especially important when working with sharded clusters.
This is because data that is accessed together could be living in different shards.
The more efficiently we can route and retrieve our most important queries in a distributed system, the better performance will be. We can use data modeling and schema design patterns to help us store data together that is accessed together even when documents from one or multiple collections are spread across multiple shards.
Ultimately, the best schema design for your use case will depend entirely on the needs of your business and application.
Let's take a look at a real world scenario to learn more about how schema design patterns can be applied to horizontal scaling. In this scenario, an ecommerce platform that specializes in electronics and consumer gadgets is growing rapidly.
With the current schema design, we have a product collection that stores essential details like names, categories, and prices, while reviews are maintained as a separate collection, each referencing its associated product via a product ID field.
As our ecommerce platform gains popularity, we've seen a big increase in product listings, and customers are leaving more and more reviews.
Consequently, both the products and reviews collections are growing rapidly, exceeding the capacity of a single replica set, and horizontal scaling seems like the best solution.
Now that we plan to take advantage of MongoDB's native sharding capabilities to address the document growth in both collections, we are reviewing key workloads and reevaluating the data model as part of the schema life cycle management.
One of those key workloads includes retrieving all products in a specific category and displaying them with their five most recent reviews.
Originally, this was achieved with the aggregation framework using a lookup stage.
Lookup can be used for sharded collections as long as we are running a MongoDB version later than five point one. Since the business is running version eight point o, we could keep the original pipeline for this workload. But don't forget that data modeling is key to optimizing performance, and lookup is not a magic tool for every use case. In our scenario, we tried to keep the pipeline when we introduced sharding, and we got the results we expected.
But our test showed that the latency for this query is no longer ideal. This is because reviews and products are stored in separate collections, each of which are distributed with its own shard key. The data may not be collocated on the same shard, meaning that a product document from one shard could have reviews stored in a different shard. So retrieving the product with the top five reviews will probably require cross shard data movement.
Data modeling is key to making the most of all the features offered by MongoDB.
With the subset pattern, we can embed the most frequently accessed or relevant subset of data within the main document and store less frequently accessed data in a separate collection with a reference to the parent documents.
In this case, introducing the subset pattern to store a subset of recent reviews in the product document will improve performance and simplify queries in our workload.
Let's take a look at how we can update the schema.
Now a product document stores a subset of reviews.
There is no change to review schema, and all review documents are still stored in a separate collection. In many cases, adjusting our data model will also simplify some of our queries.
Let's look at how the original query will change. Now instead of using aggregation with the lookup stage, we can retrieve all products in a specific category, including the subset of recent reviews with this query.
In addition to simplifying the original query, performance will improve because we now only need to query a single collection to retrieve the information.
In this case, we can also use an index on the category field to support the query. But in other cases, we may be able to use the shard key and index to improve performance as well. All we need to do next is to figure out a strategy for updating the subset of reviews to keep them up to date. Great job. Not only did we improve performance by using the subset pattern to ensure that data that is accessed together is stored together, but we also simplified our queries. Let's briefly recap what we covered in this lesson.
First, we learned that the golden rule, data that is accessed together should be stored together, also applies to sharded environments.
We can use data modeling and schema design patterns to store data together that is accessed together.
Finally, we walk through a real world business scenario to demonstrate how the subset pattern could be applied in that case to store frequently accessed data together, improve performance, and simplify queries in a sharded environment.
This isn't an exhaustive list of data modeling techniques for sharded environments. There are other strategies that you can try. Make sure to visit MongoDB's documentation to learn more about horizontal scaling and data modeling solutions.
