Schema Design Optimization / Optimize Your Schema
The most popular books in our bookstore have thousands of customer reviews. This means some of the book documents in our bookstore app are very large because they have thousands of embedded customer reviews. Since we only display three reviews per book, the rest of the reviews that don't appear on a book page take up memory, which can unnecessarily impact the performance of our app. In this video, we'll show you how to use the subset schema design pattern to improve performance.
Before we learn about the pattern, let's first understand why large documents can become a problem. To keep queries running quickly, WiredTiger, MongoDB's default storage engine, keeps data that is accessed frequently in memory. This in-memory set is known as the WiredTiger internal cache. Ideally, the working set, or the portion of the indexes and documents frequently used by the application, should fit into WiredTiger's internal cache.
Database performance is impacted when the working set exceeds the internal cache size. For cases like these, we can consider the subset pattern. The subset pattern reduces document size by relocating data that is not frequently accessed in the documents. Making documents smaller means we can fit more of them in the internal cache, and this improves overall database performance.
This pattern is useful when we have a document with a large number of embedded subdocuments--
such as reviews or comments stored in an array--
but only a small subset of those embedded documents are regularly used by our application. It's time to apply the subset pattern to our bookstore app. For our app, we have an array of reviews embedded in a book document. But only three of these embedded review subdocuments need to be accessed frequently.
Let's extract all review subdocuments and store them in a separate reviews collection. We will keep a copy of the first three review documents embedded in the original book document. This will lead to duplication of the three review documents, but the contents of those documents are unlikely to change often. So handling updates on separate collections is simple in this case.
While we do want to exercise caution when introducing data duplication, in this case, it's a trade-off we're willing to make. It helps us meet our goal of reducing the size of a book document. Smaller documents improve cache usage and increase query performance. In the end, this helps keep data that is accessed together stored together.
Now that you understand how to apply the pattern, let's use MongoDB's aggregation framework to apply it to our books collection. To do this, we are going to run two pipelines on the books collection. The first pipeline will take all documents stored in the reviews array and put them into a new reviews collection. Then the second pipeline will ensure that the review array in each book document contains no more than three reviews.
In our first pipeline, we will use the unwind stage to deconstruct the reviews array field and create a new document for each review. However, each of these new documents will contain all of the data from the original book document. Since we only need to keep the product ID to associate the new review document with the book, we will use the set stage to add the product ID to the review document. Then we use the replace root stage to promote the embedded review document to the top level.
Now each of the new review documents will only contain review data and a product ID. Finally, we use the out stage to take all of the new review documents and write them to a new reviews collection. If a reviews collection already exists, the out stage will completely overwrite it with the output of the pipeline. So we should always be careful when using this stage and confirm that we won't lose any data as a result.
Great. So now we have all the reviews stored in a new reviews collection. But we still need to modify the reviews array in the book document so that it only includes three reviews. Our second pipeline is simple.
We use the set stage to overwrite the existing reviews field. To keep things simple, we'll use the slice operator to keep the first three reviews only from the original reviews array. In a real-world scenario, we would use more sophisticated criteria for selecting the review documents we want to keep--
for example, top reviews, most relevant, et cetera. The reviews array in this book document contains seven reviews when it goes through this stage and is left with only the first three reviews. Now, if we run this pipeline with the update mini-method, the resulting documents reviews array will only contain the three reviews that we need to display. Now, let's recap what we covered in this video.
The subset pattern helps us reduce the overall size of documents when they contain a large number of embedded documents, and only a small portion of those embedded documents are frequently used by the application. By applying this pattern, we can improve cache usage and improve query performance. You can apply the subset pattern to an existing collection by using the aggregation framework. Nice work.
See you in the next lesson.
