Adding a new column to a large table in long-term storage

jayW · August 5, 2024, 6:36am

Hi everyone,
I have a very big table and has most of its data in long-term storage. I need to add a column, using

ADD COLUMN new_column STRING;```


Do anyone know if that will make all the storage to Active? The dataset is on Physical billing model

jayW · August 5, 2024, 7:20am

I can’t find any reference of this in the docs

kenS · August 5, 2024, 8:04am

This page says DDL statements reset the clock (switch from long term to active) so I think the answer is yes, this would switch you to active on the whole table.

https://cloud.google.com/bigquery/pricing#storage|https://cloud.google.com/bigquery/pricing#storage

I’m not clear enough on the underlying storage to know if this is a real cost they pass on for updating/shuffling all partitions, or it’s purely a metadata change for the partitions that already exist.

TL;DR my answer is based on the docs and the docs may be misleading. So best to wait for someone who can speak with more confidence…

https://cloud.google.com/bigquery/pricing#storage|https://cloud.google.com/bigquery/pricing#storage

jayW · August 5, 2024, 8:04am

I agree that the docs are a bit vague

jayW · August 5, 2024, 8:21am

After further reading, I thing it won’t change the storage of existing data for few reason. First BigQuery store data in columnar format so if I change data in a partition for a single column it should only move that particular set of byte to active not the entire partition

jayW · August 5, 2024, 9:19am

Also the docs says

If the table is edited, the price reverts back to the regular storage pricing, and the 90-day timer starts counting from zero. Anything that modifies the data in a table resets the timer, including:
The interesting bits are
Using data manipulation language (DML)
and
Using data definition language (DDL)
DML are: INSERT, DELETE, UPDATE, ... So it does not apply here. Worth considering that I think only the data that actually change because of the DML will be moved to Active.

jayW · August 5, 2024, 9:26am

In fact, the example mention

Using a CREATE OR REPLACE TABLE statement to replace a table.
and it does not clarify other statements

jayW · August 5, 2024, 10:00am

Interpreting the docs to the doc seems to that only that statement change the storage tier and not the other so ALTER TABLE ADD COLUMN will not

jayW · August 5, 2024, 10:42am

Furthermore the docs says

All other actions do not reset the timer, including the following:
• Querying a table
• Creating a view that queries a table
• Exporting data from a table
• Copying a table (to another destination table)
• Patching or updating a <https://cloud.google.com/bigquery/docs/reference/rest/v2/tables|table resource>

jayW · August 5, 2024, 10:50am

Particularly relevant is Patching or updating a table resource. It seems to me that adding a column is a patch operation so it will not reset the clock

rosanna · August 5, 2024, 11:43am

we regularly add columns to schemas, and while it hadn’t crossed my mind that it could affect storage, I just went back and looked, and it does not change the storage tier

rosanna · August 5, 2024, 11:55am

that’s been true historically for us on both logical and now physical

jayW · August 5, 2024, 12:54pm

I run few tests on smaller tables.

Just adding the column do not change the Storage from Long-term to Active
Modifying a single value of a row or adding a new row to a partition makes the entire partition going to Active tier

jayW · August 5, 2024, 12:57pm

I didn’t expect that as I thought only the actual new bytes would have moved to Active and not the whole partition

jayW · August 5, 2024, 1:03pm

This also means that if you change even only one row in each partition the whole storage will go to active. Pretty sneaky behavior.

rosanna · August 5, 2024, 2:00pm

yeah, that I’ve known for a long time

rosanna · August 5, 2024, 2:01pm

we’re looking at migrating all of our data into Apache Iceberg on GCS, so we don’t have to deal with some of those oddities anymore

jayW · August 5, 2024, 2:07pm

Interesting, what about query performances?

rosanna · August 5, 2024, 2:59pm

by default, it’ll be cheaper storage than BQ, plus we’ve seen better compression than what BQ provides with Snappy, and with AutoClass, there’s an opportunity for it to get way cheaper for rarely accessed data

jayW · August 5, 2024, 3:20pm

Isn’t it slower to run queries on GCS backed tables?