Schema Design Patterns and Antipatterns / Identify Antipatterns

4:29
Our app continues to grow in popularity. When we first modeled our book documents, we stored all related book data in a single document. As we evolve, we find ourselves adding more fields to our book document instead of refactoring our schema. When this happens, it's called the bloated documents antipattern. In this video, we'll discuss how to address it. You may have heard us say that data that is accessed together should be stored together. This is a great rule to follow when modeling data in MongoDB. However, it doesn't mean all related data should be stored together. When all data related to an entity is stored in a single document, despite its access pattern, we are bloating the document. This is a problem because bloated documents increase the size of the working set and will eventually impact performance. To keep queries running quickly, WiredTiger, MongoDB's default storage engine, keeps data that is accessed frequently in memory. This in-memory set is known as WiredTiger internal cache. Ideally, the working set or the portion of indexes and documents frequently used by the application should fit into WiredTiger internal cache. Database performance is impacted when the working set exceeds the internal cache size. Let's see how this all fits together with an example from a bookstore app. Our app's Home page displays at least 10 random books. When the user clicks on a book title, they land on a new page containing the book's details. Our book documents contain all the data required to support both pages. Lately, the Home page has started taking a long time to load. And we suspect that these book documents may be to blame. Let's review some relevant details. The database for our app is running with 2 gigabytes of RAM. WiredTiger has an internal cache of 50% of the RAM minus 1 gigabyte, or 256 megabytes, whatever is larger. For our cluster, this is 512 megabytes. Now that we know our WiredTiger internal cache size, let's see if our working set fits. To do this, let's look at the logical data size of the documents in our Books collection. The logical data size of a collection is the number of documents in the collection multiplied by the average document size. Using the Data Explorer, we see that our bookstore has hundreds of thousands of books. Additionally, we can see that our document size is 1.12 kilobytes on average. Since the app Home page randomly loads Books documents, all the documents in the Books collections are part of the working set. Therefore, we estimate that the logical data size for the collection to be over 500 megabytes. We can also use the stats method in mongosh to retrieve the number of documents and the average document size. We then multiply these two numbers to determine the logical data size of our collection. In our case, the logical data size of our collection is almost equal to the available WiredTiger internal cache for our database. Let's not forget that the working set also includes indexes and other data. Therefore, we are almost certain that the working set will exceed the available memory, leading to significant performance degradation. To address this issue, we have two options. We can provision more memory on the cluster at a cost. Or we can update our data model to use the existing memory more efficiently. Let's take a closer look at the second option. By inspecting a portion of our book document, we see that only a couple of fields are used on our Home page-- the title and the author. Most of the other information in the book document is only used in the Details page. This is a clear case of bloated document because we are storing data that is accessed separately in one document. To fix this problem, we will split the data into two separate documents. Let's call them summary and details. Next, let's take a look at our logical data size after the split. We see that the average document size in the Summary collection is 79 bytes. This makes the logical data size roughly 35 megabytes. The majority of the data is now stored in the Details collection. This collection is only access when book details need to be displayed. As a result, our working set fits in memory. And the application performance is fully restored. Let's quickly recap what we've learned in this video. Bloated documents store data that is accessed separately in a single document. The necessary data increases our working set size, which decreases database performance. We can fix this problem by estimating our working set size and updating our data model to reduce document size and improve performance. Nice job. See you in the next lesson.