I'm working on a telecom project, which uses Hadoop - hive for data analysis. For one day, we will get millions of records. After specified no of days, we need to delete the old data, since we don't have storage capacity. What is the best way to delete records?
Additional Info:
Those hive tables will have a column which having the populated date.
Use the following syntax to delete data from a Hive table. DELETE FROM tablename [WHERE expression]; Delete any rows of data from the students table if the gpa column has a value of 1 or 0.
You will find rm command in your Hadoop fs command. This command is similar to the Linux rm command, and it is used for removing a file from the HDFS file system. The command –rmr can be used to delete files recursively.
There is no way you can delete the DBPROPERTY. set hive. mapred.
You can use TRUNCATE TABLE to get rid of data.
I think your use case lends itself very well to using a "day" partition in your Hive table. If the "day" is only a column it will become hard to maintain and clean your table.
What a partition really means in Hive is that you will have a directory for each "day"
So for example:
create table mytable (
...
) partitioned by (day string)
So when you add data you would have for example the following structure in HDFS:
/user/hive/warehouse/mytable/day=2013-01-10
/user/hive/warehouse/mytable/day=2013-01-11
...
And in each partition you would have the content for that day.
Then deleting old data becomes trivial, you could for example have a cron job that runs daily and deletes data for x days ago (for example using the command date --date '30 days ago' +\%Y-\%m-\%d
) and simply do the following command:
hadoop fs -rmr -skiptrash /user/hive/warehouse/mytable/day=`date --date '30 days ago' +\%Y-\%m-\%d`
At this point you will still have your partition in your Hive table if you do a show partitions mytable
but any query to the partitions where you deleted data will return nothing. If you want to really delete the partition in the metastore you have to use the following command:
ALTER TABLE mytable DROP PARTITION (day='xxxx-xx-xx');
I've been using a very similar pattern successfully for a while and it works well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With