Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe type datetime64[ns] is not working in Hive/Athena

I am working on a python application which just converts csv file to hive/athena compatible parquet format and I am using fastparquet and pandas libraries to perform this. There are timestamp values in csv file like 2018-12-21 23:45:00 which needs to be written as timestamp type in parquet file . Below is my code that am running ,

columnNames = ["contentid","processed_time","access_time"]

dtypes = {'contentid': 'str'}

dateCols = ['access_time', 'processed_time']

s3 = boto3.client('s3')

obj = s3.get_object(Bucket=bucketname, Key=keyname)

df = pd.read_csv(io.BytesIO(obj['Body'].read()), compression='gzip', header=0, sep=',', quotechar='"', names = columnNames, error_bad_lines=False, dtype=dtypes, parse_dates=dateCols)

s3filesys = s3fs.S3FileSystem()

myopen = s3filesys.open

write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)

the code ran successfully , below is the dataframe created by pandas

contentid                 object
processed_time            datetime64[ns]
access_time               datetime64[ns]

And finally , when i queried the parquet file in Hive and athena , the timestamp value is +50942-11-30 14:00:00.000 instead of 2018-12-21 23:45:00

Any help is highly appreciated

like image 381
prasannads Avatar asked Dec 25 '18 06:12

prasannads


2 Answers

I know this question is old but it is still relevant.

As mentioned before Athena only supports int96 as timestamps. Using fastparquet it is possible to generate a parquet file with the correct format for Athena. The important part is the times='int96' as this tells fastparquet to convert pandas datetime to int96 timestamp.

from fastparquet import write
import pandas as pd

def write_parquet():
  df = pd.read_csv('some.csv')
  write('/tmp/outfile.parquet', df, compression='GZIP', times='int96')
like image 58
Ditlev Stjerne Avatar answered Oct 01 '22 06:10

Ditlev Stjerne


You could try:

dataframe.to_parquet(file_path, compression=None, engine='pyarrow', allow_truncated_timestamps=True, use_deprecated_int96_timestamps=True)
like image 33
Nguyễn Văn Thưởng Avatar answered Oct 01 '22 05:10

Nguyễn Văn Thưởng