Hive: Creating smaller table from big table

Question

I currently have a Hive table that has 1.5 billion rows. I would like to create a smaller table (using the same table schema) with about 1 million rows from the original table. Ideally, the new rows would be randomly sampled from the original table, but getting the top 1M or bottom 1M of the original table would be ok, too. How would I do this?

Lukas Vermeer · Accepted Answer

As climbage suggested earlier, you could probably best use Hive's built-in sampling methods.

INSERT OVERWRITE TABLE my_table_sample 
SELECT * FROM my_table 
TABLESAMPLE (1m ROWS) t;

This syntax was introduced in Hive 0.11. If you are running an older version of Hive, you'll be confined to using the PERCENT syntax like so.

INSERT OVERWRITE TABLE my_table_sample 
SELECT * FROM my_table 
TABLESAMPLE (1 PERCENT) t;

You can change the percentage to match you specific sample size requirements.

Hive: Creating smaller table from big table

Tags:

hadoop

hive

stackoverflowuser2010

1 Answers

Lukas Vermeer

Recent Activity

Donate For Us

Hive: Creating smaller table from big table

Tags:

hadoop

hive

stackoverflowuser2010

1 Answers

Lukas Vermeer

Related questions

Recent Activity

Donate For Us