Sharding in stateful storage with millions of tenants

allanKingsley · July 22, 2022, 1:05am

Hi. How do people approach sharding in state full/storage systems where there are hundreds of millions of tenants to shard and their size range from tiny to large enough that can’t be handled by a single shard/server? The naive approach that comes to mind is to make shards larger than a single physical machine, but the downside of that is that small tenants will get spread to more machines that they need to thus losing some efficiency.

horaceH · July 22, 2022, 1:58am

I am not sure what you mean by making shards larger than a single physical machine. Shards are always smaller than a physical machine’s capacity.

allanKingsley · July 22, 2022, 2:33am

Here I meant sort of overlay/virtual shards. Eg you group machines in groups of 3 and call each of these group a shard. Now for writes and reads you need to go to all of them. But at least they can house the data for very big tenants.

carrie · July 22, 2022, 3:14am

> Eg you group machines in groups of 3 and call each of these group a shard
> …
> But at least they can house the data for very big tenants.
Usually, those three nodes represent a single Paxos/consensus group and are there for replication, not to make shards larger than a single node.
Shards close to node size have some negative operations characteristics, e.g. slow/risky recovery on data loss.

allanKingsley · July 22, 2022, 4:58am

So how do you shard then? Say these are time series like events and there isn’t a natural key to shard by except tenant id.

carrie · July 22, 2022, 5:58am

Replication is not directly related to sharding.

In the case you described, you have name of time series stream, e.g. “/foo/bart/baz” which could be translated to stream_id by the metadata layer (or used directly).

Both tenant_id + stream_id will be a shard key. You can use consistent hashing to shards this composite key.

In this case single shard will be the size of a single stream (keep in mind retention policies) or could be further split on time range.

allanKingsley · July 22, 2022, 6:40am

Some of these streams get more throughout than a single node can handle. Also, consistent sharding leads to very unbalanced distribution due to the high skewness in the throughput of these streams. And an additional constraint, these streams might get unpredictable spikes in throughout.

carrie · July 22, 2022, 7:43am

Interesting. Can you tell more about the use case?

allanKingsley · July 22, 2022, 9:14am

There are a few use cases in mind. But the most illustrative one would be google analytics/yandex metrika, cloudflare analytics etc.

carrie · July 22, 2022, 10:25am

I think those systems use user_id as part of the shard key

allanKingsley · July 22, 2022, 10:52am

We went full circle