Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:

name      dob             age     info
"tom"     "11/27/2000"    "45"    "['one', 'two']"

And in the query, I'm currently doing the following

WITH
  table AS (
  SELECT
    "tom" AS name,
    "11/27/2000" AS dob,
    "45" AS age,
    "['one', 'two']" AS info )
SELECT
  EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
  ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
  ANY_VALUE(name) example_name,
  ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
  table
GROUP BY
  EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))

Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):

enter image description here

Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?

Off the top of my head, the only limitations I see are:

  • Queries would become more complex (though wouldn't really matter if using a query-builder).
  • A bit more difficult to extract non-string items from array fields.
  • Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).

But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?

like image 993
David542 Avatar asked Sep 02 '25 02:09

David542


1 Answers

First of all - I don't really see any bigger show-stoppers than those you already know and enlisted

Meantime,

though wouldn't really matter if using a query-builder ...

based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)

While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example

So, my thoughts are:

When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.

Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.

So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support

Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery

like image 131
Mikhail Berlyant Avatar answered Sep 04 '25 16:09

Mikhail Berlyant