How, exactly, does BigQuery’s PK/FK setup work on the backend? I’m specifically interested in knowing where performance is/isn’t worth it for building columns with keys if you’re creating and destroying tables. When is defining the column worth it vs. not worth it–if the table exists for 1 hour? 2? 10? Etc.
I don’t think there’s any case where it’s not worth defining it, it’s just metadata
My primary outcome is to take advantage of the performance improvements at https://cloud.google.com/blog/products/data-analytics/join-optimizations-with-bigquery-primary-and-foreign-keys?hl=en; I just don’t know if the indexing is immediate or done on a schedule. Thus my concern about how quickly I can create and destroy tables while still getting the most out of key-based performance.
there are no indexes involved. As it says in the article, they only impact query plans
trade off is you have to be sure the pks and fks are correct. If they’re not, you’ll silently get incorrect query results. BQ doesn’t enforce them the way a row store would.
I’ve not played with them. In AWS Redshift we found we could get the incorrect answer because the plan chose to assume they were accurate, even though there was no enforcement in Redshift. I’ve not played with that in BigQuery, but it if affects plans, then I’d be suspicious of that “feature”. There are no enforcements of the PK/FKs constraints.
And now I see that already addressed my thought.
Agreed . Yes the BQ docs say the same thing, that if you tell it the keys it will rely on what you told it and if you’re wrong you will get “wrong” query results. The theoretical improved performance and reduced cost wasn’t worth the risk to us of incorrect query results.
and all. Yup, BigQuery is very explicit about it.
From the URL, https://cloud.google.com/blog/products/data-analytics/join-optimizations-with-bigquery-primary-and-foreign-keys
This suggests that my work on implementing DBT testing could allow us to apply these and give the optimizer more information to work with.
One trap we fell into here is these all need to be in the same dataset (makes sense when you think about it).
For us I’d say it’s helping. We have a legacy system that doesn’t use them it’s factors slower.
Metadata is also another advantage. Gemini is much more likely to write a correct query with them in place than without.
If you were to have DBT tests for every single PK/FKey relationship tested before the queries that utilize them kick in, then there would be no issue. The issue is that the usage of the constraints by the optimizer can get you the incorrect answer, and as you say, you would get that incorrect answer more quickly, but it could be incorrect.
hah I am sure there is some Yogi Berra quote along the lines of “A wronger answer quicker is better”