Sharding Strategies - Training

You are currently acting as a learner.

Sharding Strategies / Training

What You'll Learn

Distribute Data in a Sharded Cluster: Explain how to partition data across shards and manage unsharded collections within a sharded cluster.

Distributing Data in a Sharded Cluster

When distributing data in a sharded cluster, you should have a strategic mindset focused on optimal performance and scalability.

The primary method to distribute data is by partitioning single, large collections across multiple shards, using a designated shard key. This method is implemented using the shardCollection command, as shown in the following example:

sh.shardCollection(
   "leafybank.messages", 
    {userId: 1}
);

Let’s visualize this strategy.

In this visual, first you can see collection B get partitioned based on the shard key { userId: 1 } using the shardCollection command. The data is then spread across shard1 and shard2. Collection D is partitioned across shards in the same manner. Collections A and C remain unsharded.

Partitioning collections is essential for handling already substantial and growing datasets that exceed certain scale thresholds. It is recommended to partition collections across shards when they start to approach 3 TB in size.

When sharding a collection, a shard key is critical for data and workload distribution and thus, query performance. Before sharding a collection, you must choose a well-designed shard key that aligns with your app’s query patterns and growth. We’ll discuss shard keys in more depth later.

How to Shard a Collection

In the video, we’ll show how to partition a collection across shards, focusing again on our LeafyBank scenario.

Select Play to learn more.

2:29

In MongoDB 8.0, users have two advanced mechanisms for managing collections in a sharded cluster: moving unsharded collections to dedicated shards and using resharding to distribute data faster than chunk migrations orchestrated by the balancer.

Select each mechanism to learn more.

If a user has multiple collections on one shared replica set, these collections compete for limited resources, often leading to performance bottlenecks. As data grows, increased disk I/O causes latency and strains system resources, degrading overall application performance.

The recommendation for these situations is to move unsharded collections to dedicated shards in your cluster by using the moveCollection command. This is a new feature in MongoDB 8.0.

This strategy is visualized below. There are four unsharded collections in this sharded cluster: A, B, C, and D. As mentioned before, not all collections in a sharded cluster need to be partitioned across shards. Sharded collections can coexist in a sharded cluster with unsharded collections. Collections B and D can be moved to their own shard, providing the benefits of horizontal scaling without actually sharding the collection.

The benefits of moving collections on dedicated shards are:

Workload Separation: Prevent resource contention by assigning collections to specific shards. You can use it to serve a multi-tenant architecture or for geographic data location requirements.
Hardware Optimization: Configure asymmetric shards with hardware tailored to specific collection requirements.
Independent Scaling: Scale collections individually according to workload needs.
Improved Resilience: Reduce recovery time by isolating potential failures.

This operation is online. You can continue reads and writes on the collection being moved while it is being written on shard1.

Say if LeafyBank’s accounts collection was growing, and the collection was competing for resources with other collections on shard0. It would be best to move the collection to its own shard to leverage dedicated resources. Here’s an example of how they’d use moveCollection command:

sh.moveCollection(
   "leafybank.accounts", 
    "shard02"
);

Resharding is an effective strategy for moving data with no downtime, ensuring no workload impact. By leveraging the reshard-to-shard mechanism, you can effectively distribute your collection's data across all shards. Unlike chunk migrations, resharding writes in parallel to all shards and drops the old collection at the end of the process, ensuring no orphan documents remain.

To learn more, consult MongoDB’s documentation on the reshard-to-shard mechanism. We will cover resharding in greater depth later.

Key Points to Remember

Great progress! So far, you’ve learned how to deploy a sharded cluster and now distribute collections within that architecture. Here are some key points to remember:

Partition Data: Use the shardCollection command to distribute a collection across shards.
Create Indexes for Shard Keys: Collections with existing data require indexes for shard keys. Use the createIndex command to create one. If there is no data in the collection, the shardCollection command creates the index automatically.
Manage Unsharded Collections: As of MongoDB 8.0, users may also use the moveCollection command to move unsharded collections onto dedicated shards.

Next, we’ll explore how to design an optimal shard key for your workload when partitioning a collection across shards.

Select Next to continue.