Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I shard a BigQuery table?

Apologies if this has been answered elsewhere, I couldn't find anything similar.

Are there any ways to split a table into multiple shards without having to use multiple queries? Here are two examples:

1) Loading a table with timestamped data (unix timestamp). I want to save the data into one table by day. The naive approach is: a) load the data; b) run a query to get all the data for each day and append it into the appropriate shard. This approach will result in queries that touch N x [size of the whole table] data, where N is the number of days in the table. Plus one more query to find the min and max timestamp so I can establish the range of shards I need to create.

2) Splitting a table to shards using the data in a field. For example, a table of 1 billion rows, containing a field X with 1,000 different values. If I want to split the table to 1000 different tables, one table for each of the values of X, then the naive approach would be to run a SELECT * FROM table WHERE X=[value], and insert the results into shard table_value. However, that would result in 1000 queries, each touching the data of the whole table!

Surely I am missing something and there must be more efficient ways to do the above.

like image 535
user3688176 Avatar asked Jan 22 '16 20:01

user3688176


People also ask

What is sharding in BigQuery?

Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD . Partitioning is recommended over table sharding, because partitioned tables perform better. With sharded tables, BigQuery must maintain a copy of the schema and metadata for each table.

What is the difference between partitioning and sharding?

Sharding and partitioning are both about breaking up a large data set into smaller subsets. The difference is that sharding implies the data is spread across multiple computers while partitioning does not. Partitioning is about grouping subsets of data within a single database instance.

What does it mean to partition a table?

Table partitioning is a data organization scheme in which table data is divided across multiple storage objects called data partitions according to values in one or more table columns. Each data partition is stored separately.


2 Answers

A Requirements:

Let’s assume below simplified case/scenario

1 We have one "big" table:
TableAll

Row a   b   c
1   1   11  12
2   1   13  14
3   1   15  16
4   1   17  18
5   2   21  22
6   2   23  24
7   2   25  26
8   2   27  28
9   3   31  32
10  3   33  34
11  3   35  36
12  3   37  38

2 We need to split data to separate "smaller" tables partitioned by filed "a"
TableA1

Row b   c
1   11  12
2   13  14
3   15  16
4   17  18

TableA2

Row b   c
1   21  22
2   23  24
3   25  26
4   27  28

TableA3

Row b   c
1   31  32
2   33  34
3   35  36
4   37  38

3 Problem to address
Most straightforward way is to issue three separate statements with writing output to respectively TableA1, TableA2, TableA3

SELECT b, c FROM TableAll WHERE a = 1;
SELECT b, c FROM TableAll WHERE a = 2;
SELECT b, c FROM TableAll WHERE a = 3;

Pros: Fast and Furious!
Cons: We need as many table scans of whole table (full cost) as many distinct value of "a" we have (in this particular case just three, but in real life it can be let’s say up to N=1K distinct values).

So final Cost is $5 * N * SizeInTB(TableAll)

Our Target Goal

We want to minimize cost as much as possible 
ideally down to fixed price of $5 * SizeInTB(TableAll)  

B Possible Solution (Idea and simple implementation):

Logical Step 1 – transform data to be presented as below (transform columns into JSON)

Row a   json
1   1   {"b":"11", "c":"12"}
2   1   {"b":"13", "c":"14"}
3   1   {"b":"15", "c":"16"}
4   1   {"b":"17", "c":"18"}
5   2   {"b":"21", "c":"22"}
6   2   {"b":"23", "c":"24"}
7   2   {"b":"25", "c":"26"}
8   2   {"b":"27", "c":"28"}
9   3   {"b":"31", "c":"32"}
10  3   {"b":"33", "c":"34"}
11  3   {"b":"35", "c":"36"}
12  3   {"b":"37", "c":"38"}

Logical Step 2 – Pivot table so that values of field "a" become name of fields (prefixed with a to make sure we comply with column name convention)

Row a1                    a2                    a3
1   {"b":"11", "c":"12"}  null                  null
2   {"b":"13", "c":"14"}  null                  null
3   {"b":"15", "c":"16"}  null                  null
4   {"b":"17", "c":"18"}  null                  null
5   null                  {"b":"21", "c":"22"}  null
6   null                  {"b":"23", "c":"24"}  null
7   null                  {"b":"25", "c":"26"}  null
8   null                  {"b":"27", "c":"28"}  null
9   null                  null                  {"b":"31", "c":"32"}
10  null                  null                  {"b":"33", "c":"34"}
11  null                  null                  {"b":"35", "c":"36"}
12  null                  null                  {"b":"37", "c":"38"}

Note: size of above data is of same order as size of original table (w/o column a)
It is still bigger than original data because data now is in verbose json format vs native data types + column names.
This can be optimized by eliminating spaces, not needed quotes, normalizing/minimizing original column names to have just one char in name, etc.
I think this difference becomes negligible with N going up! (haven’t had chance to evaluate this though)

Step 3 – Preserve resulted pivot into table TableAllPivot Implementation Example:

SELECT 
  IF(a=1, json, NULL) as a1,
  IF(a=2, json, NULL) as a2,
  IF(a=3, json, NULL) as a3 
FROM (
  SELECT a, CONCAT("{\"b\":\"",STRING(b), "\","," \"c\":\"", STRING(c), "\"}") AS json 
  FROM TableAll
)

Cost of Step 3: $5 * TableAllSizeInTB
Based on comments in Step 2 assume: Size(TableAllPivot) = 2 * Size(TableAll)

Step 4 – Produce Shards, by querying only one column per shard
To preserve schema/data-types – respective Shard Tables can be created in advance

Data Extraction :
//For TableA1:

SELECT 
  JSON_EXTRACT_SCALAR(a1, '$.b') AS b, 
  JSON_EXTRACT_SCALAR(a1, '$.c') AS c 
FROM TableAllPivot
WHERE NOT a1 IS NULL

//For TableA2:

SELECT 
  JSON_EXTRACT_SCALAR(a2, '$.b') AS b, 
  JSON_EXTRACT_SCALAR(a2, '$.c') AS c 
FROM TableAllPivot
WHERE NOT a2 IS NULL

//For TableA3:

SELECT 
  JSON_EXTRACT_SCALAR(a3, '$.b') AS b, 
  JSON_EXTRACT_SCALAR(a3, '$.c') AS c 
FROM TableAllPivot
WHERE NOT a3 IS NULL

Cost of Step 4: $5 * TableAllPivot

Total Cost: Step 3 Cost + Step 4 Cost =
$5 * SizeInTB(TableAll) + $5 * SizeInTB(TableAllPivot) ~ $5 * 3 * SizeInTB(TableAll)

Summary:
Proposed approach fixed price = $5 * 3 * SizeInTB(TableAll)
vs.
Initial linear price = $5 * N * SizeInTB(TableAll)

Please note: 3 in $5 * 3 * SizeInTB(TableAll) formula is not defined by number of shards in my simplified example, but rather estimated constant that mostly reflects price of transforming data to json. Number of shards doesnt matter here. Same formula will be for 100 shards and for 1K shard and so on. The only limitation in this solution is 10K shards as this is a hard limit for number of columns in one table

C Some helper code and references:

1 Produce Pivoting Query (result is used in step 3 above section)
Can be useful for number of fields in initial table greater than let's say 10-20, when typing query manually is boring, so you can use below script/query

SELECT 'SELECT ' + 
   GROUP_CONCAT_UNQUOTED(
      'IF(a=' + STRING(a) + ', json, NULL) as a' + STRING(a) 
   ) 
   + ' FROM (
 SELECT a, 
 CONCAT("{\\\"b\\\":\\\"\",STRING(b),"\\\","," \\\"c\\\":\\\"\", STRING(c),"\\\"}") AS json
 FROM TableAll
       )'
FROM (
  SELECT a FROM TableAll GROUP BY a 
)

2 In case if you want to explore and dive more into this option - see also below references to related & potentially useful here code

Pivot Repeated fields in BigQuery
How to scale Pivoting in BigQuery?
How to extract all the keys in a JSON object with BigQuery

like image 94
Mikhail Berlyant Avatar answered Sep 21 '22 00:09

Mikhail Berlyant


2018 update

  • Instead of creating multiple tables, create a partitioned one.
  • Partition for free: Create partitioned table (by date), import into it.
  • Partition with one query (one scan): CREATE TABLE ... AS SELECT * FROM old-table

See the following post to also benefit from clustering:

  • https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b

I really like Mikhail's answer, but let me give you a different one: Divide to conquer:

Let's say your table has 8 numbers (think of each number as a partition): 12345678. To shard this into 8 tables, you are looking into running 8 times a query over a table size 8 (cost: 8*8=64).

What if you first divide this table into 2: 1234, 5678. The cost was 8*2 (2 full scans), but we now have 2 tables. If we want to partition these half tables, now we only need to scan half 2 times (2*4*2). Then we are left with 4 tables: 12,34,56,78. The cost of partitioning them would be 4*2*2... so the total cost would be 8*2+2*4*2+4*2*2=48. By doing halves we took the cost of partitioning a table in 8 from 64 to 48.

Mathematically speaking, we are going from O(n**2) to O(n(log n)) - and that's always a good thing.

Cost-wise Mikhail's answer is better, as it goes from O(n**2) to O(n), but writing the intermediate helper functions will introduce additional complexity to the task.

like image 30
Felipe Hoffa Avatar answered Sep 21 '22 00:09

Felipe Hoffa