I seek insights and best practices for managing Data Lineage in BigQuery data pipelines. Our setup primarily involves Cloud Composer jobs executing BigQuery tasks, with occasional Dataflow jobs pulling data from upstream sources. Could you share your experiences or recommendations for effectively tracking and maintaining data lineage in this environment? Thank you!
you should look into dbt. i find that having the data models be connected in the way it does helps a lot with “lineage” and you can model the loads from dataflow with sources
you can really eliminate a lot of the dataflow too by taking more of an ELT approach but how raw you decide to store it can affect cost.
gcp has cloud catalog but i found dbt to be a better practical solution.
Thank you!
I was thinking of dbt. However, infrastructure is already in place. So, I am leaning towards Log Sinks and modeling tables to reflect data flow.