Fundamentals of Data Transformation / Common Aggregation Pipelines: $match, $group, $unwind, $project
MongoDB's aggregation framework provides many stages to query, process, and transform data. While you can use and combine stages as you like, there are several common pipelines you can use as blueprints to accomplish common tasks.
In this video, we'll walk you through one common aggregation pipeline sequence using the match, group, and project stages.
Match is used to filter for specific documents. The group stage groups documents by a specified value.
Finally, the project stage is used so the pipeline returns only the fields we need. Before we get into the pipeline, let's briefly introduce you to the dataset.
We'll be working with data from an online bookstore app in a sales collection.
We wanna generate a report showing the total revenue generated from book sales for each genre of book during twenty twenty four. We can use this information to make informed decisions about inventory and marketing. We'll use MongoDB's aggregation framework to accomplish this task. Let's get started.
First, let's look at an example document from the sales collection.
As you can see, it contains documents which record all of the books sold to a specific customer on a particular date. The customer is recorded as an object ID, which references our customer's collection.
Note that the books field is an array where each book in the array contains the genre and price. We need to group all books sold by genre. To do this, we'll use the match and group stages.
Since the genres are stored in an array, we also need an unwind stage after the match stage in our pipeline.
The unwind stage deconstructs an array field from the input documents.
Unwind outputs a document for each element of the array, effectively flattening the array. By unwinding the books array, each book sold in every sales document will be represented by its own document in the output of this stage.
This is crucial for setting up the next stage where we group them by genre.
Since we want our report to aggregate data from the year twenty twenty four, the first thing we'll do is ensure that we're only working with sales documents from that year.
To accomplish this, we'll write a match stage using the greater than or equal to and less than operators to specify that the date should be on or after January first twenty twenty four and before January first twenty twenty five. As we discussed earlier, it's usually best to place a match stage at the beginning of a pipeline. Not only will you reduce the size of the dataset being passed along to the next stage, but you can also make use of existing indexes to improve performance.
If you only ran the match stage on your data, the pipeline would return all documents that have a date field with a value in the year twenty twenty four. If we only wanted to do this, we could have used a find operation instead.
However, we want to perform calculations and transform this data, so we'll need to add more stages to our pipeline.
So now we have sales by year. Next, to track each book sold, we'll need to unwind the books array from each sales document so we can group them by genre. Now that we've used unwind, we have one document for each book sold in twenty twenty four.
The next step is grouping these documents by the genre and calculating how much revenue they generated.
We'll do this by adding the group stage to our pipeline and specifying the group key as the genre of each book.
Since the genre field is contained within the embedded books document, we'll use dot notation to specify our group key as books dot genre. We'll also use the sum operator to add up all the sales for each book and create a field named total revenue. The The total revenue field will show us how much revenue each genre has generated.
Now we've transformed our data from sales by customer to revenue by genre.
We have one document for each genre that lists the total revenue for that genre. Now our output documents contain all the information we need, but they could be a little cleaner and easier to read. We can use the project stage to exclude, include, rename, or even create new fields in our output documents to clean up the final dataset. Let's tidy things up by changing the group ID field name to genre so it's easy for other stakeholders to understand what each document represents.
To do this, we'll start by removing the group ID field by setting its value to zero.
We'll add a field named genre and tell MongoDB to print the value of each group ID field in the new genre field. Finally, we'll be sure to include the total revenue field we calculated during the group stage by setting its value to one.
After running our aggregation pipeline, our output documents look like this. Now we can see the total revenue generated by each of our genres.
Alright. Let's take a moment to recap what we covered in this video. In this video, we discussed the match, group, project aggregation pipeline stage sequence and created an example pipeline to illustrate how you can make use of these aggregation stages to transform your data and get exactly the information you need. We also made use of the unwind stage because the value we are using to group our documents was contained within an array.
