Apologies if this has been answered elsewhere, I couldn't find anything similar.
Are there any ways to split a table into multiple shards without having to use multiple queries? Here are two examples:
1) Loading a table with timestamped data (unix timestamp). I want to save the data into one table by day. The naive approach is: a) load the data; b) run a query to get all the data for each day and append it into the appropriate shard. This approach will result in queries that touch N x [size of the whole table] data, where N is the number of days in the table. Plus one more query to find the min and max timestamp so I can establish the range of shards I need to create.
2) Splitting a table to shards using the data in a field. For example, a table of 1 billion rows, containing a field X with 1,000 different values. If I want to split the table to 1000 different tables, one table for each of the values of X, then the naive approach would be to run a SELECT * FROM table WHERE X=[value], and insert the results into shard table_value. However, that would result in 1000 queries, each touching the data of the whole table!
Surely I am missing something and there must be more efficient ways to do the above.
Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD . Partitioning is recommended over table sharding, because partitioned tables perform better. With sharded tables, BigQuery must maintain a copy of the schema and metadata for each table.
Sharding and partitioning are both about breaking up a large data set into smaller subsets. The difference is that sharding implies the data is spread across multiple computers while partitioning does not. Partitioning is about grouping subsets of data within a single database instance.
Table partitioning is a data organization scheme in which table data is divided across multiple storage objects called data partitions according to values in one or more table columns. Each data partition is stored separately.
A Requirements:
Let’s assume below simplified case/scenario
1 We have one "big" table:
TableAll
Row a b c
1 1 11 12
2 1 13 14
3 1 15 16
4 1 17 18
5 2 21 22
6 2 23 24
7 2 25 26
8 2 27 28
9 3 31 32
10 3 33 34
11 3 35 36
12 3 37 38
2 We need to split data to separate "smaller" tables partitioned by filed "a"
TableA1
Row b c
1 11 12
2 13 14
3 15 16
4 17 18
TableA2
Row b c
1 21 22
2 23 24
3 25 26
4 27 28
TableA3
Row b c
1 31 32
2 33 34
3 35 36
4 37 38
3 Problem to address
Most straightforward way is to issue three separate statements with writing output to respectively TableA1, TableA2, TableA3
SELECT b, c FROM TableAll WHERE a = 1;
SELECT b, c FROM TableAll WHERE a = 2;
SELECT b, c FROM TableAll WHERE a = 3;
Pros: Fast and Furious!
Cons: We need as many table scans of whole table (full cost) as many distinct value of "a" we have (in this particular case just three, but in real life it can be let’s say up to N=1K distinct values).
So final Cost is $5 * N * SizeInTB(TableAll)
Our Target Goal
We want to minimize cost as much as possible
ideally down to fixed price of $5 * SizeInTB(TableAll)
B Possible Solution (Idea and simple implementation):
Logical Step 1 – transform data to be presented as below (transform columns into JSON)
Row a json
1 1 {"b":"11", "c":"12"}
2 1 {"b":"13", "c":"14"}
3 1 {"b":"15", "c":"16"}
4 1 {"b":"17", "c":"18"}
5 2 {"b":"21", "c":"22"}
6 2 {"b":"23", "c":"24"}
7 2 {"b":"25", "c":"26"}
8 2 {"b":"27", "c":"28"}
9 3 {"b":"31", "c":"32"}
10 3 {"b":"33", "c":"34"}
11 3 {"b":"35", "c":"36"}
12 3 {"b":"37", "c":"38"}
Logical Step 2 – Pivot table so that values of field "a" become name of fields (prefixed with a to make sure we comply with column name convention)
Row a1 a2 a3
1 {"b":"11", "c":"12"} null null
2 {"b":"13", "c":"14"} null null
3 {"b":"15", "c":"16"} null null
4 {"b":"17", "c":"18"} null null
5 null {"b":"21", "c":"22"} null
6 null {"b":"23", "c":"24"} null
7 null {"b":"25", "c":"26"} null
8 null {"b":"27", "c":"28"} null
9 null null {"b":"31", "c":"32"}
10 null null {"b":"33", "c":"34"}
11 null null {"b":"35", "c":"36"}
12 null null {"b":"37", "c":"38"}
Note: size of above data is of same order as size of original table (w/o column a)
It is still bigger than original data because data now is in verbose json format vs native data types + column names.
This can be optimized by eliminating spaces, not needed quotes, normalizing/minimizing original column names to have just one char in name, etc.
I think this difference becomes negligible with N going up! (haven’t had chance to evaluate this though)
Step 3 – Preserve resulted pivot into table TableAllPivot Implementation Example:
SELECT
IF(a=1, json, NULL) as a1,
IF(a=2, json, NULL) as a2,
IF(a=3, json, NULL) as a3
FROM (
SELECT a, CONCAT("{\"b\":\"",STRING(b), "\","," \"c\":\"", STRING(c), "\"}") AS json
FROM TableAll
)
Cost of Step 3: $5 * TableAllSizeInTB
Based on comments in Step 2 assume: Size(TableAllPivot) = 2 * Size(TableAll)
Step 4 – Produce Shards, by querying only one column per shard
To preserve schema/data-types – respective Shard Tables can be created in advance
Data Extraction :
//For TableA1:
SELECT
JSON_EXTRACT_SCALAR(a1, '$.b') AS b,
JSON_EXTRACT_SCALAR(a1, '$.c') AS c
FROM TableAllPivot
WHERE NOT a1 IS NULL
//For TableA2:
SELECT
JSON_EXTRACT_SCALAR(a2, '$.b') AS b,
JSON_EXTRACT_SCALAR(a2, '$.c') AS c
FROM TableAllPivot
WHERE NOT a2 IS NULL
//For TableA3:
SELECT
JSON_EXTRACT_SCALAR(a3, '$.b') AS b,
JSON_EXTRACT_SCALAR(a3, '$.c') AS c
FROM TableAllPivot
WHERE NOT a3 IS NULL
Cost of Step 4: $5 * TableAllPivot
Total Cost: Step 3 Cost + Step 4 Cost
=$5 * SizeInTB(TableAll) + $5 * SizeInTB(TableAllPivot)
~ $5 * 3 * SizeInTB(TableAll)
Summary:
Proposed approach fixed price = $5 * 3 * SizeInTB(TableAll)
vs.
Initial linear price = $5 * N * SizeInTB(TableAll)
Please note: 3
in $5 * 3 * SizeInTB(TableAll)
formula is not defined by number of shards in my simplified example, but rather estimated constant that mostly reflects price of transforming data to json. Number of shards doesnt matter here. Same formula will be for 100 shards and for 1K shard and so on. The only limitation in this solution is 10K shards as this is a hard limit for number of columns in one table
C Some helper code and references:
1 Produce Pivoting Query (result is used in step 3 above section)
Can be useful for number of fields in initial table greater than let's say 10-20, when typing query manually is boring, so you can use below script/query
SELECT 'SELECT ' +
GROUP_CONCAT_UNQUOTED(
'IF(a=' + STRING(a) + ', json, NULL) as a' + STRING(a)
)
+ ' FROM (
SELECT a,
CONCAT("{\\\"b\\\":\\\"\",STRING(b),"\\\","," \\\"c\\\":\\\"\", STRING(c),"\\\"}") AS json
FROM TableAll
)'
FROM (
SELECT a FROM TableAll GROUP BY a
)
2 In case if you want to explore and dive more into this option - see also below references to related & potentially useful here code
Pivot Repeated fields in BigQuery
How to scale Pivoting in BigQuery?
How to extract all the keys in a JSON object with BigQuery
CREATE TABLE ... AS SELECT * FROM old-table
See the following post to also benefit from clustering:
I really like Mikhail's answer, but let me give you a different one: Divide to conquer:
Let's say your table has 8 numbers (think of each number as a partition): 12345678. To shard this into 8 tables, you are looking into running 8 times a query over a table size 8 (cost: 8*8=64).
What if you first divide this table into 2: 1234, 5678. The cost was 8*2 (2 full scans), but we now have 2 tables. If we want to partition these half tables, now we only need to scan half 2 times (2*4*2). Then we are left with 4 tables: 12,34,56,78. The cost of partitioning them would be 4*2*2... so the total cost would be 8*2+2*4*2+4*2*2=48. By doing halves we took the cost of partitioning a table in 8 from 64 to 48.
Mathematically speaking, we are going from O(n**2) to O(n(log n)) - and that's always a good thing.
Cost-wise Mikhail's answer is better, as it goes from O(n**2) to O(n), but writing the intermediate helper functions will introduce additional complexity to the task.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With