Sharding in stateful storage with millions of tenants

Hi. How do people approach sharding in state full/storage systems where there are hundreds of millions of tenants to shard and their size range from tiny to large enough that can’t be handled by a single shard/server? The naive approach that comes to mind is to make shards larger than a single physical machine, but the downside of that is that small tenants will get spread to more machines that they need to thus losing some efficiency.

I am not sure what you mean by making shards larger than a single physical machine. Shards are always smaller than a physical machine’s capacity.

Here I meant sort of overlay/virtual shards. Eg you group machines in groups of 3 and call each of these group a shard. Now for writes and reads you need to go to all of them. But at least they can house the data for very big tenants.

> Eg you group machines in groups of 3 and call each of these group a shard
> …
> But at least they can house the data for very big tenants.
Usually, those three nodes represent a single Paxos/consensus group and are there for replication, not to make shards larger than a single node.
Shards close to node size have some negative operations characteristics, e.g. slow/risky recovery on data loss.

So how do you shard then? Say these are time series like events and there isn’t a natural key to shard by except tenant id.

Replication is not directly related to sharding.

In the case you described, you have name of time series stream, e.g. “/foo/bart/baz” which could be translated to stream_id by the metadata layer (or used directly).

Both tenant_id + stream_id will be a shard key. You can use consistent hashing to shards this composite key.

In this case single shard will be the size of a single stream (keep in mind retention policies) or could be further split on time range.

Some of these streams get more throughout than a single node can handle. Also, consistent sharding leads to very unbalanced distribution due to the high skewness in the throughput of these streams. And an additional constraint, these streams might get unpredictable spikes in throughout.

Interesting. Can you tell more about the use case?

There are a few use cases in mind. But the most illustrative one would be google analytics/yandex metrika, cloudflare analytics etc.

I think those systems use user_id as part of the shard key

We went full circle :slight_smile: