Fundamentals of Data Transformation / Introduction to Aggregation

6:28
If you work with large quantities of data, you understand how vital it is to be able to transform, combine, and filter that data. This helps you get real time insights and analytics, generate reports, and run calculations at scale. MongoDB's aggregation framework makes it easy and efficient to handle these workloads. This allows you to query your data in a highly expressive way and reduces application complexity. In this video, we'll introduce you to MongoDB's aggregation framework. We'll show you how it's used to query, process, and transform data. We'll cover some commonly used aggregation stages, discuss the importance of correctly ordering your stages, and when and how to use the aggregation framework. MongoDB's aggregation framework is a powerful tool for querying data and for performing data processing and analysis operations on collections. An aggregation pipeline consists of one or more stages. These stages query and process documents. Let's look at how it works. After an aggregation pipeline is initiated and optimized by MongoDB's query optimizer, an initial cursor is automatically created to handle the flow of documents through the pipeline. The execution engine begins retrieving documents from the collection and adding them to the cursor. If an index is available, the execution engine will use it to increase efficiency. Then the cursor sends the documents to the pipeline. Each stage performs an operation on the input documents, such as calculating values or traversing an array. Then the documents that are output from a stage are passed on to the next stage. This continues until the end of the pipeline where the final output is delivered. MongoDB offers many different aggregation stages. Let's take a quick look at some of the most common stages to give you a sense of what's possible. The match stage filters documents in the pipeline to pass only those that meet specified criteria to the next stage. Placing it at the beginning of your pipeline can optimize performance by reducing the amount of data that's passed to subsequent stages. When placed at the beginning of the pipeline, match can also leverage existing indexes to improve performance. The sort stage orders the documents in the pipeline based on a specified field or fields. Documents can be sorted in ascending or descending order. A sort stage can also leverage an index when it's placed towards the beginning of a pipeline. The limit stage restricts the number of documents passed to the next stage or final output to a number that we specify. This stage is especially helpful when it appears after a sort stage. Whenever possible, the query optimizer will combine the two stages. This allows the sort operation to maintain fewer documents in memory as it works, which can improve performance. The group stage combines documents into groups based on a group key. This stage is useful when you need to perform operations or calculations such as sum, average, or count on the grouped data. The lookup stage performs a left outer join to incorporate related data from another collection into the documents. This operation is similar to a left outer SQL join. The unwind stage deconstructs an array field from the input documents. Unwind outputs a document for each element of the array. In other words, it flattens the array. Finally, the project stage reshapes each document so specific fields are included in, excluded from, or added to the output. This isn't an exhaustive list of stages. It's just the beginning. Visit the MongoDB docs to learn about more stages. When we build an aggregation pipeline, the sequence of the stages matters. This is because the order in which the operations executed by each stage are performed on the documents in your pipeline will impact the accuracy of the results. Additionally, only certain stages can leverage indexes to make execution more efficient. Here are some best practices that you can follow to sequence stages. Filter documents early with stages like match to decrease the size of the data. This way, we're only retrieving and processing what we need. Next, try to play stages that can leverage indexes, like match and sort, closer to the beginning of a pipeline. This helps the execution engine use relevant indexes to retrieve documents more efficiently. Finally, use project towards the end of a pipeline. This helps us avoid losing fields that are necessary for optimization and to calculate accurate results. Luckily, the query optimization phase happens automatically before pipeline execution. This will reshape a pipeline to improve performance. For example, the query optimizer might combine or move stages to leverage an index as long as the final output isn't impacted. This optimization is built into the framework. We don't have to do anything to enable it. It's just helpful to know that it's there. So now you know a little bit more about how MongoDB's aggregation framework processes data. But how do we actually build an aggregation pipeline? MongoDB gives us a number of options. We could choose to build an aggregation pipeline programmatically by using syntax from a MongoDB driver, and we can include it in our application code that way. Or we can build, run, and export pipelines using MongoDB Compass. Compass helps us visualize documents as they're processed by each stage, which can be very useful. Any of these tools make it easy to get started using the aggregation framework to process and transform data. Your next question may be, when is it a good idea to use aggregation? Well, just like anything else, it depends on your use case. However, the simplest solution is often the best. For simple data retrieval tasks, it's usually best to use find operations. When you need to perform operations like complex data processing that involves multiple transformations or data restructuring steps, the Aggregation Framework is a great tool. Nice work. Let's recap what we covered in this lesson. MongoDB's aggregation framework is a powerful tool for performing data processing and analysis operations on collections. An aggregation pipeline consists of one or more stages that process documents. There are a number of stages that you can use, but common stages include match, group, sort, project, limit, lookup, and unwind. The sequencing of these stages matters and will impact results and performance. Great work. See you in the next video.