Adding a new column to a large table in long-term storage

Hi everyone,
I have a very big table and has most of its data in long-term storage. I need to add a column, using

ADD COLUMN new_column STRING;```


Do anyone know if that will make all the storage to Active? The dataset is on Physical billing model

I can’t find any reference of this in the docs

This page says DDL statements reset the clock (switch from long term to active) so I think the answer is yes, this would switch you to active on the whole table.

https://cloud.google.com/bigquery/pricing#storage|https://cloud.google.com/bigquery/pricing#storage

I’m not clear enough on the underlying storage to know if this is a real cost they pass on for updating/shuffling all partitions, or it’s purely a metadata change for the partitions that already exist.

TL;DR my answer is based on the docs and the docs may be misleading. So best to wait for someone who can speak with more confidence…

https://cloud.google.com/bigquery/pricing#storage|https://cloud.google.com/bigquery/pricing#storage

I agree that the docs are a bit vague

After further reading, I thing it won’t change the storage of existing data for few reason. First BigQuery store data in columnar format so if I change data in a partition for a single column it should only move that particular set of byte to active not the entire partition

Also the docs says

If the table is edited, the price reverts back to the regular storage pricing, and the 90-day timer starts counting from zero. Anything that modifies the data in a table resets the timer, including:
The interesting bits are
Using data manipulation language (DML)
and
Using data definition language (DDL)
DML are: INSERT, DELETE, UPDATE, ... So it does not apply here. Worth considering that I think only the data that actually change because of the DML will be moved to Active.

In fact, the example mention

Using a CREATE OR REPLACE TABLE statement to replace a table.
and it does not clarify other statements

Interpreting the docs to the doc seems to that only that statement change the storage tier and not the other so ALTER TABLE ADD COLUMN will not

Furthermore the docs says

All other actions do not reset the timer, including the following:
• Querying a table
• Creating a view that queries a table
• Exporting data from a table
• Copying a table (to another destination table)
• Patching or updating a <https://cloud.google.com/bigquery/docs/reference/rest/v2/tables|table resource>

Particularly relevant is Patching or updating a table resource. It seems to me that adding a column is a patch operation so it will not reset the clock

we regularly add columns to schemas, and while it hadn’t crossed my mind that it could affect storage, I just went back and looked, and it does not change the storage tier

that’s been true historically for us on both logical and now physical

I run few tests on smaller tables.

  1. Just adding the column do not change the Storage from Long-term to Active
  2. Modifying a single value of a row or adding a new row to a partition makes the entire partition going to Active tier

I didn’t expect that as I thought only the actual new bytes would have moved to Active and not the whole partition

This also means that if you change even only one row in each partition the whole storage will go to active. Pretty sneaky behavior.

yeah, that I’ve known for a long time

we’re looking at migrating all of our data into Apache Iceberg on GCS, so we don’t have to deal with some of those oddities anymore

Interesting, what about query performances?

by default, it’ll be cheaper storage than BQ, plus we’ve seen better compression than what BQ provides with Snappy, and with AutoClass, there’s an opportunity for it to get way cheaper for rarely accessed data

Isn’t it slower to run queries on GCS backed tables?