Handling large JSON objects in Kafka event streaming architecture for ElasticSearch

Large Message/Best Practice Question
We are moving to Kafka from mostly Postgres, so we’re still on the learning path regarding Kafka. The messages are fairly small except for a property in the message can be a json object of any size (could be 10’s of MB in size). We are currently making changes to store those large json objects in GCS rather than the DB. With the move to Kafka and event streaming architecture, the messages will go to Kakfa instead of Postgres and on to ElasticSearch. With respect to the large json object property I want those to end up in ElasticSearch as well, I’m assuming streaming those through Kafka is probably not the best idea given the potential size, and I should find another approach and move them directly from GCS to Elasticsearch?

It can be done, but certainly not recommended.

ref. https://stackoverflow.com/questions/21020347/how-can-i-send-large-messages-with-kafka-over-15mb

LinkedIn also created a solution for large messages that splits them between batches, but you may have issues using Kafka Connect with that.

https://github.com/linkedin/li-apache-kafka-clients#features

Thanks , I like the idea of continue to put it in GCS and pushing a reference to it in Kafka, and not pushing the large payloads through Kafka, or adding complexity with splitting into batches.

In GCS, you can use triggers and functions to send file-events to a Kafka topic

https://cloud.google.com/functions/docs/calling/storage

(Or PubSub, then Kafka Connect can communicate with PubSub. Depends on your goals)

Oh, we use pubsub extensively now. All very interesting. Definitely reading through all that. Thanks so much.

You might also be able to (partially) replace Postgres/Elasticsearch with BiqQuery / Looker, if you want to fully use GCP services.

In our PoC we are sinking to Big Query for long term storage, and internal usage right now.

Never heard of Looker before, will take a look. :grinning: