Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

For Hive partition based on date, why use string type? why not int?

Tags:

hadoop

hive

If I'm defining a table in Hive, and will be partitioning based on date, and my dates are in the format YYYYMMDD, which should I choose for the type, int or string?

If it was just a field, and therefore in the files I'm supplying for the table, I could see using a string, even if only so that I can search for and identify malformed entries that might work their way into my data. But since I will be specifying the partition as part of the load process, I know I'll always have correctly formed values.

When used in a Where clause, the partition field will normally be equality or less-than/greater-than logic.

like image 780
libjack Avatar asked Mar 04 '13 16:03

libjack


1 Answers

Dates are typically treated as strings in Hive. If you look at all the date manipulation UDFs available, they use string types, so if you were using integers you would have to cast them every time.

Conceptually also I think it makes more sense to use strings, your YYYYMMDD is just a literal representation of a date object, but it is implicitly equivalent to something like YYYY-MM-DD or DDMMYYYY. So if you were using an integer here, it becomes painful to do such comparisons.

Note that you can also compare strings in Hive with equality/greater/lower-than operators, if you want to select a range of partitions you can easily do that with these operators.

The only case I would see using a "date" as an integer is using a timestamps (Unix-style) because it is a continuous value and represents a real measurable quantity.

like image 77
Charles Menguy Avatar answered Oct 04 '22 00:10

Charles Menguy