Is Kafka suitable for replicating multiple dependent tables with foreign key constraints?

> Hello everyone, I’m wondering if Kafka is suitable for continuously replicating multiple dependent tables from one database to another. Our goal is to transfer data from a legacy Informix database to PostgreSQL for OLAP purposes. The source database consists of around 100 highly normalised relational tables with numerous foreign key relationships. We have a few requirements for the replication process:
Anyone with any ideas or experience with transaction consistency across topics?

Well, this is a tricky one, because if there’s foreign-key constraints you can’t guarantee in-order-delivery of messages across topics.

So one pattern you could do is to put all of the messages for all of the tables in a schema in the same kafka topic , then be smart about knowing “ok this message came from table A, that from table B”.

Then be sure to partition them according to their “business objct” so that all things related by foreign keys end up on the same partition. That way you can guarantee ordering so that you don’t violate any foreign key constraints when inserting into a replicated table.

Thank you so much for the help. This sounds like an interesting idea. We talked internally that our ideal state is to refactor the app to produce business objects, but this is going to take some time (old legacy database with 500 tables and 500,000 lines of stored procs!)

I don’t know Debezium that well, but maybe it might support that where all items get put into the same topic. Because if you have FK’s on your destination table, and the messages are in different topics, you’re going to run into FK errors I think. I know that the people who wrote Debezium are probably quite smart so they probably have some solution for that, but I’m not sure if it’s the same one I described.

In our situation the destination database doesn’t have the FK installed (performance reasons), so the sequence of the events within a transaction being committed is not important - we only care that when commit occurs that database to be in a consistent state.

Is eventual consistency acceptable?

If so, I think the defaults will work given that you don’t have FK constraints on the destination.

Eventual consistency (depending on that they actually is), won’t be good for us as we run will periodically run extract queries on the destination PG database which involves joins across multiple tables. So if that moment a query runs and some table is missing the FK table data update we will get incorrect results.

So yeah basically we need the destination database to be in a consistent state. Lag on that consistency is ok though.

Got it. I don’t know Debezium, so I can’t tell you whether it’s sufficient for you, but at a high level, you probably want all events for a single traction to be:
• in the same topic
• with the transaction id as the message key to ensure the same partition.
Then you can have some consumer application (maybe Debezium but I don’t know if it even does that) that reads the entirety of those transactions and writes it to destination

I’m not sure if Kafka Connect out of the box supports that either.

OK thank you. Good to know about the transaction id key. Makes sense.

I see there is a Kafka Transaction API, but it’s pretty hard to understand it’s utility. Can that help in this case? Or is that for a different purpose?

It’s a different purpose. To make a long story short, it lets you write multiple messages to multiple partitions atomically (i.e. all of the messages succeed or none of them are written).

It doesn’t really have anything to do with ensuring that messages are read together. That’s because consumers of different topics can have vastly different lags, and coordinating between them would be a nightmare

On the producer side (getting data out of your source database), you definitely want the CDC producer to use the transaction API though, as even though you only have one partition involved, you still want to make sure the writes are atomic.

Ok thanks again. Just so I’m clear the transaction API should be used in the producer connector, to write the group of received event messages (i.e from CDC) per transaction to the topic?

Thanks again. One issue we have is this database is very complex (500 tables), and there are many commit transactions that go on. It will be an interesting to try and determine which tables get committed together. At the highest level just about all tables connect to each other by indirect FKs!

Just think about a user table that is referenced in many root entities.