Sharding Strategies - Training

You are currently acting as a learner.

Sharding Strategies / Training

What You'll Learn

Identify the Optimal Shard Key for Your Application: Implement the optimal shard key and data distribution option for your application's requirements.

Shard Keys

When partitioning a collection, the shard key is the core mechanism MongoDB uses to distribute a collection across shards, ensuring data is organized efficiently.

The purpose of a shard key is to evenly distribute read and write operations across all shards—at scale. Additionally, read operations should target the lowest number of shards possible. Achieving this goal reduces latency for important queries.

To do this, be sure to carefully analyze critical query patterns and potential shard key candidates.

To achieve an optimal shard key, let’s look at the different parts of a shard key and how they affect the sharding of a collection.

Breaking Down Shard Keys

As we mentioned before, partitioning a collection is implemented via the shardCollection command. This is how you shard a collection with your selected shard key. The two fields that determine the design of a shard key contain the shard key field(s) and the distribution option:

Note

Every sharded collection must have a shard key defined, and the shard key must be indexed. With an index supporting the shard key, MongoDB can efficiently process queries, sort data, and enforce constraints. This significantly enhances performance, especially for large datasets.

Let’s first look at the shard key field.

Shard Key Field

When you use the shardCollection command, you designate one or more fields, represented as <field1>, <field2>, and so on, to act as the shard key. These fields are pivotal in the distribution process. Every document in the collection gets mapped based on the values of these fields. Therefore, a good shard key must be granular enough to ensure effective distribution.

In the following short video, we’ll examine a document from the LeafyBank messages collection to determine the shard key field.

Select Play to learn more.

1:30

Choosing a shard key field can seem complicated. Fortunately, there are key characteristics and other metrics you can use to assess the effectiveness of a shard key.

Key Characteristics of an Effective Shard Key

We’ll focus on a few key characteristics for assessing shard keys: cardinality, frequency, monotonicity, read distribution, and write distribution.

Select each key characteristic to learn more.

The cardinality of a shard key indicates how many pieces, or ranges, of data can be created. This affects how well the collection can grow. It's best to choose shard keys with many unique values (high cardinality), as keys with fewer unique values (lower cardinality) can limit growth and performance.

A shard key's frequency shows how often its values appear. High-frequency values can create bottlenecks, which limits scalability.

In the context of sharding, monotonicity refers to how a shard key's values either consistently increase or decrease over time. If a key is monotonic, write inserts tend to accumulate in one chunk, potentially causing uneven distribution and performance issues.

Read distribution provides insights into how read operations are distributed across the shards in a cluster. When running the analyzeShardKey command, which we will cover later, the readWriteDistribution metric includes helpful sub-metrics, specifically:

percentageOfSingleShardReads: Shows how many reads target a single shard. This is the fastest way to read data because it avoids extra work.
percentageOfMultiShardReads: Shows how many reads target multiple shards. This takes more time and resources because the mongos has to combine the results.
percentageOfScatterGatherReads: Shows how many reads check every shard. This is usually the slowest and uses the most resources.
numReadsByRange: Shows number of times each range is targeted. Avoid a shard key where the distribution of numReadsByRange is very skewed since that implies that there is likely to be one or more hot shards for reads.

Ideally, the upper limit of percentageOfMultiShardReads and percentageOfScatterGatherReads should be a combined 5-10%.

Write distribution provides insights into how write operations are distributed across the shards in a cluster. When running the analyzeShardKey command, which we will cover later, the readWriteDistribution metric includes helpful sub-metrics, specifically:

percentageOfSingleShardWrites: Shows how many writes target one shard. This is the fastest way to insert data.
numWritesByRange: Shows the number of times that each range is targeted. Avoid a shard key where the distribution of numWritesByRange is very skewed since that implies that there is likely to be one or more hot shards for writes.
percentageOfShardKeyUpdates: Measures the percentage of updates that modify shard key values, which can incur costs due to potential cross-shard moves.
percentageOfSingleWritesWithoutShardKey & percentageOfMultiWritesWithoutShardKey: Shows percentage of writes without a shard key, which is a scatter-gather query. Find a shard key that lets most of your mission critical queries target a single shard.

Distribution Options

The second aspect of designing a shard key is the distribution option. These options include ranged, hashed, and zoned sharding. The most common option is ranged, but there are use cases for hashed. You can use zoned sharding with either distribution option.

Select each distribution option to learn more.

Ranged sharding in MongoDB splits data across shards by using a range of shard key values. This makes range queries on the shard key efficient, but requires careful shard key selection to avoid hotspots due to predictable data patterns.

To select this strategy, complete the key field as follows: key: { <field>: 1}.

Example: Imagine a ride-sharing application that needs to manage and efficiently process ride data. Each ride has a field { "region": 1, "rideId": 1 } that records when the ride was requested.

The application can benefit from ranged sharding by using { region: 1, rideId: 1 } as the shard key. Reads and writes that would query by rideId would also need to include region to target the correct shard. This choice would makes range queries on rides per region very efficient. For instance, when generating reports or analyzing patterns such as "number of rides requested in the last month," the queries can quickly access the relevant data across different shards using the range of the { "region": 1, "rideId": 1 }.

Hashed sharding in MongoDB uses hashed shard key values to evenly distribute data across shards, preventing uneven inserts. Specifically, hashed sharding takes a natural shard key value, hashes it and creates a new value, and then distributes the inserts back into ranges. In special cases, it's ideal for when a natural shard key happens to be monotonically increasing or decreasing and all inserts would go all in one chunk. However, hashed sharding may not efficiently optimize range queries because it disrupts natural value order.

To select this strategy, complete the key field as follows: key: { <field>: "hashed"}.

Example: Consider an e-commerce application where each document represents an order but orderID is monotonically increasing. By using hashed sharding on the orderID, {orderID: “hashed”}, you can balance write operations across shards. This distribution is beneficial for workloads where inserting orders happens rapidly, ensuring that no single shard becomes a bottleneck due to a high volume of sequential order inserts.

To shard a collection using a hashed shard key, you'll need to create an index on the hashed shard key first. This index is essential as it enables the balancer to distribute data across the shards.

Zoned sharding lets global apps store ranges of user data in specific areas (zones), like a continent. This reduces latency by storing and accessing data close to its origin and can help meet data residency requirements.

This strategy can be combined with ranged or hashed sharding by using the addShardtoZone and updateZoneKeyRange commands. To learn how to implement zoned sharding, refer to the MongoDB documentation on Zoning.

Common Issues to Avoid

A poorly selected shard key could result in these three issues: uneven load distribution, scatter-gather queries, and jumbo chunks.

Select each issue to learn more.

Uneven load distribution occurs when data or query load is not evenly spread across shards, leading to bottlenecks and hotspots.

If a shard key leads to data being grouped in a way that is not well balanced, certain shards may end up handling significantly more requests or storing disproportionately more data than others.

Scatter-gather queries involve sending a query to multiple shards, followed by collecting and consolidating results from those shards to return a complete result set, resulting in increased overhead and latency.

The choice of a shard key affects how often scatter-gather queries occur. A shard key that aligns with critical queries enables targeted queries, which are routed to specific shards instead of being broadcast to all of them.

Jumbo chunks occur when a chunk exceeds the maximum size and cannot be split. This is often due to selecting a low cardinality shard key that leads to uneven data distribution.

The balancer can only split jumbo chunks if the shard key is refined. Jumbo chunks can only be moved with user intervention. Choosing a shard key with a high cardinality is important to prevent these situations.

Selecting the Best Shard Key

To analyze shard keys and choose the best one for your application’s workloads, you can use the configureQueryAnalyzer and analyzeShardKey commands. These commands gather metrics to assess the balance, performance, and scalability of potential shard keys based on your application’s most important and frequent queries.

In the following video, we’ll show you how to use these commands to analyze shard keys. We’ll also discuss key characteristics and performance metrics associated with shard keys.

Select Play to learn more.

7:31

Key Points to Remember

Awesome work! Here are some key points to remember as you design an optimal shard key and sharding strategy:

Understand the Role of a Shard Key: Choose a shard key that evenly distributes data and operations across shards to improve performance for all database operations.
Plan Ahead for Shard Key Selection: Picking the right shard key is crucial for your database's performance and ability to scale. Factor in your application's demands and query styles to avoid uneven data distribution.
Select the Right Distribution Options: Ranged sharding is the most common option. Use hashed keys in special cases, like when you have a monotonic shard key and need to evenly distribute writes. You may also zone ranges of data.
Use the configureQueryAnalyzer and analyzeShardKey commands: Use these commands to review query patterns and assess shard key options. Check keyCharacteristics and readWriteDistribution to make an informed choice.

Note

Sometimes the choice of a shard key to achieve scale is hard and requires application changes. Consider consulting with MongoDB’s Professional Services.

If the performance of your shard key ever degrades or needs change, don’t worry! MongoDB allows you to change your shard key decisions in easy steps. We’ll cover this next.

Select Next to continue.