Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery full table to partition

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.

That means I will have 270 partitions. What is the best way to move this data to partition table.

I dont want to run 270 queries which is very costly operation. So looking for optimized solution.

I have multiple tables like this. I need to migrate all these tables to partition tables.

Thanks,

like image 803
user374374 Avatar asked Feb 07 '23 03:02

user374374


1 Answers

I see three options

  1. Direct Extraction out of original table:
    Actions (how many queries to run) = Days [to extract] = 270
    Full Scans (how much data scanned measured in full scans of original table) = Days = 270
    Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00

  2. Hierarchical(recursive) Extraction: (described in Mosha’s answer)
    Actions = 2^log2(Days) – 2 = 510
    Full Scans = 2*log2(Days) = 18
    Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60

  3. Clustered Extraction: (I will describe it in a sec)
    Actions = Days + 1 = 271
    Full Scans = [always]2 = 2
    Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40

Summary

Method                              Actions Total Full Scans    Total Cost  
Direct Extraction                       270              270       $459.00
Hierarchical(recursive) Extraction      510               18        $30.60
Clustered Extraction                    271                2         $3.40  

Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward

Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible! From $460 down to $31!
Almost 15 times down!

In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges - take a look at third option

“Clustered Extraction” Explanation

Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan

Step 2 After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created) This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table While querying column – we will need to de-serialize row’s value and parse it back to original scheme

Very simplified example: (using BigQuery Standard SQL here)

The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)

So, assume original table (theTable) looks somehow like below

  SELECT  1 AS id, "101" AS x, 1 AS ts UNION ALL
  SELECT  2 AS id, "102" AS x, 1 AS ts UNION ALL
  SELECT  3 AS id, "103" AS x, 1 AS ts UNION ALL
  SELECT  4 AS id, "104" AS x, 1 AS ts UNION ALL
  SELECT  5 AS id, "105" AS x, 1 AS ts UNION ALL
  SELECT  6 AS id, "106" AS x, 2 AS ts UNION ALL
  SELECT  7 AS id, "107" AS x, 2 AS ts UNION ALL
  SELECT  8 AS id, "108" AS x, 2 AS ts UNION ALL
  SELECT  9 AS id, "109" AS x, 2 AS ts UNION ALL
  SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
  SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
  SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
  SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
  SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
  SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
  SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
  SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
  SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
  SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
  SELECT 20 AS id, "120" AS x, 4 AS ts

Step 1 – transform table and write result into tempTable

SELECT 
  num, 
  MAX(IF(ts=1, ser, NULL)) AS ts_1, 
  MAX(IF(ts=2, ser, NULL)) AS ts_2, 
  MAX(IF(ts=3, ser, NULL)) AS ts_3, 
  MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
  SELECT 
    ts, 
    CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser, 
    ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
  FROM theTable
)
GROUP BY num

tempTable will look like below:

num    ts_1        ts_2         ts_3        ts_4     
1   1|101|1     6|106|2     10|110|3    19|119|4     
2   2|102|1     7|107|2     11|111|3    20|120|4     
3   3|103|1     8|108|2     12|112|3        null     
4   4|104|1     9|109|2     13|113|3        null     
5   5|105|1        null     14|114|3        null     
6      null        null     15|115|3        null     
7      null        null     16|116|3        null     
8      null        null     17|117|3        null     
9      null        null     18|118|3        null    

Here, I am using simple concatenation for serialization

Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2

SELECT
  r[OFFSET(0)] AS id,
  r[OFFSET(1)] AS x,
  r[OFFSET(2)] AS ts
FROM (
  SELECT SPLIT(ts_2, "|") AS r 
  FROM tempTable 
  WHERE NOT ts_2 IS NULL
)

The result will look like below (which is expected):

id    x    ts    
6   106     2    
7   107     2    
8   108     2    
9   109     2   

I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving

like image 176
Mikhail Berlyant Avatar answered Feb 23 '23 00:02

Mikhail Berlyant