Fundamentals of Data Transformation / Working with multiple collections using $lookup
At MongoDB, we have a guiding principle.
Data that is accessed together should be stored together. However, sometimes we may have to store frequently accessed data in separate collections.
For example, you may choose to divide data to optimize read write operations, or perhaps data privacy or security policies might require certain data types to be stored separately.
So how can you combine data from multiple collections and keep your queries and application logic simple?
Luckily, MongoDB aggregation pipelines can help you do this. In this video, we'll discuss how you can use the lookup stage to combine data from multiple collections. We'll also discuss how to use the set stage to transform and structure data.
The lookup stage in MongoDB lets you combine data from the aggregation pipelines input collection with data from a second collection. This results in documents that have combined data from the input collection and the new joined collection.
These documents are then passed along to the next aggregation stage. To better understand how the lookup stage works, let's look at it in action. As you may recall, we've been building an aggregation pipeline which tracks orders from a collection named sales with documents like this.
These documents record all of the books sold to a specific customer on a particular date.
Imagine we want to include a recommendations feature. To do this, we need to identify customers who have purchased books within a specific genre. In this example, we'll look for customers who have purchased children's books from us. We'll also identify what books they've purchased in the past within that genre so we can send them recommendations for new ones via email.
To do this, we'll join customer names and email addresses from a second collection called customers with our sales collection data. The documents in the customer's collection look like this. These documents contain the customer's first and last names, email address, city and state, and any purchases the customer has made. These sales are referenced by their corresponding object IDs in the sales collection.
The object ID for the customer document is referenced in those same sales documents.
Let's use the lookup stage to combine data from the two different collections.
We'll look up the customer details from our customer's collection. We'll use the value of the customer field in our input documents. This value corresponds with a unique ID field in the documents of the customer's collection.
Finally, we'll join that data to our input documents as an array named customer details.
If we ran the pipeline with only the lookup stage, the documents we received would look like this.
The customer details array adds a lot of information to our output, which already contains much more data than we need. We should avoid sending unnecessary data across the pipeline because it can complicate logic and impact performance.
Remember, we want to identify customers who have purchased children's books. Once we've identified these customers, we need their name, email address, and what children's books they've already purchased. Clearly, we need to further transform our data. To help us transform the data, we'll unwind the newly added customer details array so we can directly access the fields. This will allow for a simpler project stage at the end of our pipeline. Now that we've got that done, we need to filter for documents that have a genre of children's literature.
To do this, we'll use the set stage. The set stage allows you to transform your data by adding new fields or updating existing fields within a pipeline.
In our set stage, we'll create a new field named children's books using filter to extract books with the genre children's literature from the books array.
This field will be created for all of our output documents, but only those that contain children's books will have data in the field. The rest will have a value of null.
The filter operator uses the following parameters.
Input, which describes what we're filtering. In our case, the books array. As, which allows us to create a variable for our filter, which we'll call book.
And cond, which allows us to specify a condition that an element of the array must meet. In this case, we use an equality expression. For each book variable created, we'll evaluate whether it meets the condition we specify. In this case, we want to check if the genre of the book matches children's literature.
When we run this pipeline, we get documents which include our sales records, the customer details from the customer associated with the sale, and the newly created children's books field. This field is only populated if the sale includes a children's book. This document does contain a children's book, and it has data in the children's books field. However, this pipeline will also output sales documents that contain no children's books, which isn't very useful for us.
Let's add in a simple match stage requiring that any documents output after this stage must not have a value of null for the children's books field. Now our pipeline only shows us sales records when the customer purchased a children's book. Finally, we'll use a project stage.
This project stage will remove the ID field from the output documents, combine the customer's first and last names into a single string, include the customer's email and the children's books field that we created with our set stage.
By including the three fields that we want, everything else will be excluded by the project stage.
Our aggregation pipeline is finished. Let's see what we get when we run it. And look at that. We have exactly what we wanted.
There's now one document for each customer who has purchased children's books from us. This document includes their name, email address, and what children's books they've already purchased. Great work. Let's review what we've learned in this video.
The lookup stage joins two different collections and adds match documents as an array. The set stage adds new fields or updates existing fields in your pipeline documents. We also created an aggregation pipeline using these stages. See you in the next one.
