Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop - Hive : Delete data which is older than specified no of days

I'm working on a telecom project, which uses Hadoop - hive for data analysis. For one day, we will get millions of records. After specified no of days, we need to delete the old data, since we don't have storage capacity. What is the best way to delete records?

Additional Info:

Those hive tables will have a column which having the populated date.

like image 972
Abimaran Kugathasan Avatar asked Jan 11 '13 07:01

Abimaran Kugathasan


People also ask

How do I delete data from Hive?

Use the following syntax to delete data from a Hive table. DELETE FROM tablename [WHERE expression]; Delete any rows of data from the students table if the gpa column has a value of 1 or 0.

Can we delete data in Hadoop?

You will find rm command in your Hadoop fs command. This command is similar to the Linux rm command, and it is used for removing a file from the HDFS file system. The command –rmr can be used to delete files recursively.

Is it possible to delete Dbproperty in Hive?

There is no way you can delete the DBPROPERTY. set hive. mapred.

Which property can help deleting the external table with data in Hive?

You can use TRUNCATE TABLE to get rid of data.


1 Answers

I think your use case lends itself very well to using a "day" partition in your Hive table. If the "day" is only a column it will become hard to maintain and clean your table.

What a partition really means in Hive is that you will have a directory for each "day"

So for example:

create table mytable (
   ...
) partitioned by (day string)

So when you add data you would have for example the following structure in HDFS:

/user/hive/warehouse/mytable/day=2013-01-10
/user/hive/warehouse/mytable/day=2013-01-11
...

And in each partition you would have the content for that day.

Then deleting old data becomes trivial, you could for example have a cron job that runs daily and deletes data for x days ago (for example using the command date --date '30 days ago' +\%Y-\%m-\%d) and simply do the following command:

hadoop fs -rmr -skiptrash /user/hive/warehouse/mytable/day=`date --date '30 days ago' +\%Y-\%m-\%d`

At this point you will still have your partition in your Hive table if you do a show partitions mytable but any query to the partitions where you deleted data will return nothing. If you want to really delete the partition in the metastore you have to use the following command:

ALTER TABLE mytable DROP PARTITION (day='xxxx-xx-xx');

I've been using a very similar pattern successfully for a while and it works well.

like image 191
Charles Menguy Avatar answered Oct 06 '22 08:10

Charles Menguy