Aggregating each node's data on the primary node

hazel · January 8, 2022, 1:55am

Hello guys
I’m new to kafka, sorry for possible misunderstanding
My usecase: 9 server cluster (3 nodes quorum-replicated, 6 others just parallel nodes); all the nodes write some data to postgres locally (separate postgres clusters); I’d like to aggregate each node’s data on the primary node, taking the summary from each node’s data

Is this an adequate usecase from kafka/CDC/debezium/whatever or are they an overkill for my usecase? If they are, why? And what better tools to use then?

Thanks in advance

dJones · January 8, 2022, 2:40am

What do you mean by “3 nodes quorum-replicated, 6 others just parallel nodes” ?

hazel · January 8, 2022, 3:04am

1 primary node, 2 replicas of it
other 6 nodes are just basically in the same network and doing the same work as the ‘main’ 3 ones, but are not technically part of a quorum cluster

dJones · January 8, 2022, 3:42am

You might want to read up a bit on how Kafka works.

dJones · January 8, 2022, 4:49am

You setup Kafka clusters. It is not a master replica kinda thing like databases.

hazel · January 8, 2022, 4:54am

Can’t I just use a single-node kafka cluster on a primary node?

dJones · January 8, 2022, 6:13am

Like I said, there is no primary node. You talk to the cluster.

dJones · January 8, 2022, 6:31am

You can have a single broker cluster but I would defn not recommend that for production.

hazel · January 8, 2022, 8:21am

Well, I’m bounded by a single primary node…
But anyway, thanks for suggestions

dJones · January 8, 2022, 9:37am

Once you create topics, their replicas are created on other brokers in that cluster, assuming you are using replication-factor 3 and have 3 or more brokers in that cluster.

hazel · January 8, 2022, 11:09am

I see that for it to work it’s best to have several nodes for kafka and several for the target store (data wharehouse etc.) and maybe several for the source

hazel · January 8, 2022, 11:39am

Apparently I have no free room for a separate kafka cluster

dJones · January 8, 2022, 12:23pm

What is your objective? is this a PoC?

dJones · January 8, 2022, 12:26pm

If its a PoC you can do everything on one machine. just dont expect data resiliency in case something goes wrong.

hazel · January 8, 2022, 1:50pm

I have a storage system that is pretty much a rack of 9 nodes, 3 management nodes, 1 of them primary at a time
It has usage statistics on each node
I’d like to aggregate those stats in real-time to the primary (and then from there send them somewhere else or show in UI or …)

dJones · January 8, 2022, 2:38pm

Why not use something like prometheus? dont need kafka in that case

dJones · January 8, 2022, 4:31pm

I should correct myself. I have no clue what storage system you are using.

michaleObrien · January 8, 2022, 6:20pm

If it’s application logging your after, I once used Logstash to get them to Kafka. Not sure that’s still the best. There might be a Kafka Connect Connector for it these days. Something like https://docs.confluent.io/kafka-connect-spooldir/current/connectors/elf_source_connector.html