You are currently acting as a learner.

Schema Design Optimization / Scale Your Data Model

6:50

When modeling data with MongoDB, data that is accessed together should be stored together. But what does this look like in a sharded cluster where one or more collections are distributed across multiple shards? In this lesson, we will briefly review some key concepts related to sharding to demonstrate why careful data modeling is essential to scaling horizontally. After that, we'll look at how the principles of embedding and referencing apply to a sharded environment. Before we begin, let's quickly revisit some basic sharding concepts that will help us understand why data modeling matters in a sharded environment. First, why shard? Sharding is usually a good solution when your dataset grows too large, your read and write operations become too intensive for a single server, or you require better distribution and performance across multiple machines. Sharding, otherwise known as horizontal scaling, is the process of distributing data across multiple servers or shards. A sharded collection is split into chunks, each containing a different set of documents from the collection. The range of data that is included in each chunk is determined by the shard key. A shard key is a single index field or multiple fields covered by a compound index. We choose the shard key, and then MongoDB uses the shard key to create chunks. Once MongoDB creates these chunks, they are distributed as evenly as possible across shards. Because data is partitioned this way, queries can perform differently than they would on a single replica set. Let's look at the life cycle of a query in a sharded cluster to understand why this impacts performance. As you can see, we have the client, the mongo s that routes client queries, the config server that stores metadata about where data is located in the cluster, and the shards themselves. Say we run a simple find operation on a sharded collection. The client submits a request to the mongo s, then the mongo s uses cached metadata from the metadata server to know which shard stores the data we need. Then the mongo s routes the request to the correct shard. That shard returns the data to the mongo s, and the mongo s returns the data to the client. Performance can be impacted based on how our data is partitioned across shards and the size of our shards. If we need to retrieve multiple documents, like we do with this range query, and the documents are spread over multiple shards, it is possible that our query will take longer to execute. So how we distribute the data is really important, and we determine this with a sharding strategy. Choosing a good shard key and sharding strategy impacts the performance of your key queries and workloads. We can reshard a collection if necessary, but it is a process that requires careful planning, especially if you're working with a version before MongoDB eight point o. Even if data is optimally distributed and the workload is evenly balanced across shards, it's still important to model our data correctly to ensure data is retrieved as efficiently as possible. Remember, when modeling our data in MongoDB, we have the choice to embed data within the parent document or reference data in another collection. When modeling the relationships for our workloads, referencing and embedding will have different implications on how data is distributed across shards. If not carefully planned, this distribution can add unnecessary complexity to how queries are routed and executed. Let's investigate embedding and referencing in a sharded cluster by looking at a very simple example that involves books and authors. Let's first imagine that books and authors are stored in separate collections in a sharded cluster where both collections are sharded. In this scenario, author documents will include a reference to the book ID field in the parent book document. Because each collection has its own shard key and ranges, it is very possible that book data and its referenced author data will be located on different shards. In other words, author data might be located on shard a, while book data is located on shard b. If we query for a book and its author or authors using the book ID, in order to retrieve all data, our query will be routed to both shard a and shard b. We will either have to use two queries and handle the logic to combine book and author data on the client side, or we will have to use a lookup operation to accomplish this, both of which can cause latency issues that are usually aggravated in a distributed system. Next, imagine that author data is embedded in a book document. Since author and book data are stored together, our query can be routed to one shard in order to retrieve book and author data for one book. In general, you should endeavor to use the power of MongoDB's document model by embedding data in the parent document. However, when choosing to embed or reference in a sharded cluster, this decision should be entirely based on your data, access patterns, and individual business needs. Always start by understanding your access patterns and key workloads. If data is frequently accessed together, embedding can reduce the need for complex joins and improve read performance since all related data is stored in a single document. If related data is accessed independently, referencing might be a better choice. Referencing can also help you avoid bloated documents and unbounded arrays. Let's briefly recap what we covered in this lesson. Sharding, otherwise known as horizontal scaling, is the process of distributing data across multiple servers or shards. In a sharded cluster, both your sharding strategy and how your data is modeled can impact query performance. Embed related data whenever possible in order to store data together that is frequently accessed together. Referencing relationships in a sharded environment can have the effect of physically dividing data that logically belongs together. To join this data, we will have to use logic on the client side or lookup operations. To avoid this, only reference relationships when data is accessed independently or when bloated documents and unbounded arrays are a concern.