I have a csv file with a time column representing POSIX timestamps in milliseconds. When I read it in pandas, it correctly reads it as Int64 but I would like to convert it to a DatetimeIndex. Right now I first convert it to datetime object and then cast it to a DatetimeIndex.
In [20]: df.time.head()
Out[20]:
0 1283346000062
1 1283346000062
2 1283346000062
3 1283346000062
4 1283346000300
Name: time
In [21]: map(datetime.fromtimestamp, df.time.head()/1000.)
Out[21]:
[datetime.datetime(2010, 9, 1, 9, 0, 0, 62000),
datetime.datetime(2010, 9, 1, 9, 0, 0, 62000),
datetime.datetime(2010, 9, 1, 9, 0, 0, 62000),
datetime.datetime(2010, 9, 1, 9, 0, 0, 62000),
datetime.datetime(2010, 9, 1, 9, 0, 0, 300000)]
In [22]: pandas.DatetimeIndex(map(datetime.fromtimestamp, df.time.head()/1000.))
Out[22]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-09-01 09:00:00.062000, ..., 2010-09-01 09:00:00.300000]
Length: 5, Freq: None, Timezone: None
Is there an idiomatic way of doing this? And more importantly is this the recommended way of storing non-unique timestmaps in pandas?
For non-standard datetime parsing, use pd.to_datetime after pd.read_csv . To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with utc=True . See Parsing a CSV with mixed timezones for more.
replace() function is used to replace the member values of the given Timestamp. The function implements datetime. replace, and it also handles nanoseconds.
Timestamp is the pandas equivalent of python's Datetime and is interchangeable with it in most cases. It's the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas. Parameters ts_inputdatetime-like, str, int, float. Value to be converted to Timestamp.
You can use a converter in combination with read_csv.
In [423]: d = """\
timestamp data
1283346000062 a
1283346000062 b
1283346000062 c
1283346000062 d
1283346000300 e
"""
In [424]: fromtimestamp = lambda x:datetime.fromtimestamp(int(x) / 1000.)
In [425]: df = pandas.read_csv(StringIO(d), sep='\s+', converters={'timestamp': fromtimestamp}).set_index('timestamp')
In [426]: df.index
Out[426]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-09-01 15:00:00.062000, ..., 2010-09-01 15:00:00.300000]
Length: 5, Freq: None, Timezone: None
In [427]: df
Out[427]:
data
timestamp
2010-09-01 15:00:00.062000 a
2010-09-01 15:00:00.062000 b
2010-09-01 15:00:00.062000 c
2010-09-01 15:00:00.062000 d
2010-09-01 15:00:00.300000 e
Internally, Timestamps are stored in int representing nanoseconds. They use the numpy datetime/timedelta. The issue with your timestamps is that they are in ms precision, which you already know since you're dividing by 1000. In this case, it's easier to astype('M8[ms]'). It's essentially saying view these ints as datetime-ints with ms precision.
In [21]: int_arr
Out[21]:
array([1283346000062, 1283346000062, 1283346000062, 1283346000062,
1283346000300])
In [22]: int_arr.astype('M8[ms]')
Out[22]:
array(['2010-09-01T09:00:00.062-0400', '2010-09-01T09:00:00.062-0400',
'2010-09-01T09:00:00.062-0400', '2010-09-01T09:00:00.062-0400',
'2010-09-01T09:00:00.300-0400'], dtype='datetime64[ms]')
Pandas will assume any regular int array is in M8[ns]. An array with a datetime64 dtype will be correctly interpreted. You can view the M8[ns] representation of a DatetimeIndex by access ing it's asi8
attribute.
[EDIT] I realize that this won't help you directly with read_csv. Just thought I'd throw out how to quickly convert between timestamp arrays.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With