Resources to evaluate Clickhouse

kennyHarold · July 14, 2023, 7:36pm

Any good post/talk about clickhouse and how to operate it?
exploring the idea of setting it up for a mostly write-once-read-a-lot datastore. We’re currently using a mysql DB and hitting some limits regarding table size, and specially cost. Seems that the usecase lends itself for clickhouse but before going forward, I want to better understand what it could potentially look like.

From the docs i see that you can probably put it in front of a database and it should just work, but there’s also the option of leveraging S3 storage for colder data

carrie · July 14, 2023, 9:04pm

You are aware of the difference in between OLTP and OLAP workloads?

carrie · July 14, 2023, 9:19pm

> From the docs i see that you can probably put it in front of a database and it should just work, but there’s also the option of leveraging S3 storage for colder data
From what I know about it, it should be not the case. The whole power of the CH is the performance ratio you can get from it by exploiting data compression and vectorization hard that usually requires lots of specialized adoption

carrie · July 14, 2023, 11:00pm

The simplest good post you can find is about how CH can handle millions of time series sensors data on one local notebook

kennyHarold · July 15, 2023, 12:31am

So the workload is pretty OLAP, we’re storing payment events and then aggregating it to monthly/daily based on a model.

> From what I know about it, it should be not the case.
I understood that it was possible to setup connection access (this) but not sure how to move to a more native CH storage (which i think is the good long term solution)

is this https://clickhouse.com/blog/working-with-time-series-data-and-functions-ClickHouse the post you mean?

carrie · July 15, 2023, 1:28am

I meant this one https://altinity.com/blog/2020/1/1/clickhouse-cost-efficiency-in-action-analyzing-500-billion-rows-on-an-intel-nuc

carrie · July 15, 2023, 2:53am

Those guys are good, there are many articles on exploitation over the net

carrie · July 15, 2023, 3:24am

https://www.slideshare.net/Altinity/clickhouse-data-warehouse-101-the-first-billion-rows-by-alexander-zaitsev-and-robert-hodges-altinity

carrie · July 15, 2023, 4:38am

They even have k8s operator if you need one

kennyHarold · July 15, 2023, 5:06am

Yeah, i’ve been looking at that. From what i’ve been told it just works, so that sounds very promising

johnRichardson · July 15, 2023, 5:47am

The MySQL thing you are linking too would be pretty useless from what I understand. Its use case is more of joining some clickhouse data with mysql. Say you run some analytics on data in clickhouse and you want to enrich data after aggregation. To get clickhouse performance you need to find a way to move data to clickhouse mergetree family storage

johnRichardson · July 15, 2023, 6:10am

There is an engine doing this for you, it’s experimental and I can’t say much about it https://clickhouse.com/docs/en/engines/database-engines/materialized-mysql

johnRichardson · July 15, 2023, 6:43am

Alternatively, CDC to Kafka then Kafka to ClickHouse (via https://clickhouse.com/blog/kafka-connect-connector-clickhouse-with-exactly-once). That would be something I would be comfortable with running in production.

johnRichardson · July 15, 2023, 7:11am

Operating clickhouse beyond few tables and few nodes becomes a bit of a pain w or w/o the operator and change is ops intensive. When it was initially designed, it was designed as a single node thing, then replication, distributed query execution, distributed schema changes were added/slapped on top and the current ops experience isn’t too pretty. So if you are embarking on this journey, be prepared that you’ll need to invest time in keeping it running smoothly.

kennyHarold · July 15, 2023, 8:31am

So the operator is key for minimizing ops/toil?

the CH <-> MySQL i didn’t have too many expectations. I understand that there might be a lot of data model reworking before getting to an operational state

johnRichardson · July 15, 2023, 10:32am

I’d say so, but don’t expect magic.

I’m not too familiar with its state now. But last time I checked 2y ago (that’s like a lifetime ago in CH ecosystem), a lot of things were done on “best effort” basis. ClickHouse didn’t expose good primitives for automating cluster management (eg removing replicas, adding replicas, changing tables in distributed manner).

kennyHarold · July 15, 2023, 12:01pm

I’ve heard from folks at chronosphere that it’s a pretty boring system and they using for TB size loads with traces being stored there

johnRichardson · July 15, 2023, 1:01pm

My experience is from Cloudflare. It wasn’t boring at all there

kennyHarold · July 15, 2023, 1:41pm

Maybe it’s ok for TB scale but struggles with PB scale now