Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting the first index after a certain timestamp with a pandas TimeSeries

This is a two-part question, with an immediate question and a more general one.

I have a pandas TimeSeries, ts. To know the first value after a certain time. I can do this,

ts.ix[ts[datetime(2012,1,1,15,0,0):].first_valid_index()]

a) Is there a better, less clunky way to do it?

b) Coming from C, I have a certain phobia when dealing with these somewhat opaque, possibly mutable but generally not, possibly lazy but not always types. So to be clear, when I do

ts[datetime(2012,1,1,15,0,0):].first_valid_index()

ts[datetime(2012,1,1,15,0,0):] is a pandas.TimeSeries object right? And I could possibly mutate it.

Does it mean that whenever I take a slice, there's a copy of ts being allocated in memory? Does it mean that this innocuous line of code could actually trigger the copy of a gigabyte of TimeSeries just to get an index value?

Or perhaps they magically share memory and a lazy copy is done if one of the object is mutated for instance? But then, how do you know which specific operations trigger a copy? Maybe not slicing but how about renaming columns? It doesn't seem to say so in the documentation. Does that bother you? Should it bother me or should I just learn not to worry and catch problems with a profiler?

like image 901
Arthur B. Avatar asked Oct 23 '12 22:10

Arthur B.


1 Answers

Some setup:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from datetime import datetime
In [4]: dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7), datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]

In [5]: ts = pd.Series(np.random.randn(6), index=dates)

In [6]: ts
Out[6]: 
2011-01-02   -0.412335
2011-01-05   -0.809092
2011-01-07   -0.442320
2011-01-08   -0.337281
2011-01-10    0.522765
2011-01-12    1.559876

Okay, now to answer your first question, a) yes, there are less clunky ways, depending on your intention. This is pretty simple:

In [9]: ts[datetime(2011, 1, 8):]
Out[9]: 
2011-01-08   -0.337281
2011-01-10    0.522765
2011-01-12    1.559876

This is a slice containing all the values after your chosen date. You can select just the first one, as you wanted, by:

In [10]: ts[datetime(2011, 1, 8):][0]
Out[10]: -0.33728079849770815

To your second question, (b) -- this type of indexing is a slice of the original, just as other numpy arrays. It is NOT a copy of the original. See this question, or many similar: Bug or feature: cloning a numpy array w/ slicing

To demonstrate, let's modify the slice:

In [21]: ts2 = ts[datetime(2011, 1, 8):]
In [23]: ts2[0] = 99

This changes the original timeseries object ts, since ts2 is a slice and not a copy.

In [24]: ts
Out[24]: 
2011-01-02    -0.412335
2011-01-05    -0.809092
2011-01-07    -0.442320
2011-01-08    99.000000
2011-01-10     0.522765
2011-01-12     1.559876

If you DO want a copy, you can (in general) use the copy method or, (in this case) use truncate:

In [25]: ts3 = ts.truncate(before='2011-01-08')

In [26]: ts3  
Out[26]: 
2011-01-08    99.000000
2011-01-10     0.522765
2011-01-12     1.559876

Changing this copy will not change the original.

In [27]: ts3[1] = 99

In [28]: ts3
Out[28]: 
2011-01-08    99.000000
2011-01-10    99.000000
2011-01-12     1.559876

In [29]: ts                #The january 10th value will be unchanged. 
Out[29]: 
2011-01-02    -0.412335
2011-01-05    -0.809092
2011-01-07    -0.442320
2011-01-08    99.000000
2011-01-10     0.522765
2011-01-12     1.559876

This example is straight out of "Python for Data Analysis" by Wes. Check it out. It's great.

like image 93
Aman Avatar answered Oct 08 '22 21:10

Aman