Best practices for managing Data Lineage in BigQuery and Cloud Composer pipelines

rLanding · December 3, 2023, 1:35am

I seek insights and best practices for managing Data Lineage in BigQuery data pipelines. Our setup primarily involves Cloud Composer jobs executing BigQuery tasks, with occasional Dataflow jobs pulling data from upstream sources. Could you share your experiences or recommendations for effectively tracking and maintaining data lineage in this environment? Thank you!

timothyP · December 3, 2023, 2:20am

you should look into dbt. i find that having the data models be connected in the way it does helps a lot with “lineage” and you can model the loads from dataflow with sources

timothyP · December 3, 2023, 2:40am

you can really eliminate a lot of the dataflow too by taking more of an ELT approach but how raw you decide to store it can affect cost.

timothyP · December 3, 2023, 3:16am

gcp has cloud catalog but i found dbt to be a better practical solution.

rLanding · December 3, 2023, 3:28am

Thank you!

rLanding · December 3, 2023, 4:03am

I was thinking of dbt. However, infrastructure is already in place. So, I am leaning towards Log Sinks and modeling tables to reflect data flow.