I have a large DataFrame which I would like to slice so that I can perform some calculations on the sliced dataframe so that the values are updated in the original. In addition I am slicing the dataframe by a start and end time that may not exist in the index. Below is a simplified example, but I will actually want to update a number of columns based on different calculations.
In [1]: df
Out[1]:
A B C
TIME
2014-01-02 14:00:00 -1.172285 1.706200 NaN
2014-01-02 14:05:00 0.039511 -0.320798 NaN
2014-01-02 14:10:00 -0.192179 -0.539397 NaN
2014-01-02 14:15:00 -0.475917 -0.280055 NaN
2014-01-02 14:20:00 0.163376 1.124602 NaN
2014-01-02 14:25:00 -2.477812 0.656750 NaN
I have tried all of the below statements to create sdf as view for my time range:
start = datetime.strptime('2014-01-02 14:07:00', '%Y-%m-%d %H:%M:%S')
end = datetime.strptime('2014-01-02 14:22:00', '%Y-%m-%d %H:%M:%S')
sdf = df[start:end]
sdf = df[start < df.index < end]
sdf = df.ix[start:end]
sdf = df.loc[start:end]
sdf = df.truncate(before=start, after=end, copy=False)
sdf[C] == 100
Most return a copy and I get a SettingWithCopyWarning warning. The loc function says the index is incompatible with datetime. Is this something I should be able to do. The result I would like after updating the slice is:
In [1]: df
Out[1]:
A B C
TIME
2014-01-02 14:00:00 -1.172285 1.706200 NaN
2014-01-02 14:05:00 0.039511 -0.320798 NaN
2014-01-02 14:10:00 -0.192179 -0.539397 100
2014-01-02 14:15:00 -0.475917 -0.280055 100
2014-01-02 14:20:00 0.163376 1.124602 100
2014-01-02 14:25:00 -2.477812 0.656750 NaN
Can anyone please suggest a way to this? Am I approaching this the wrong way?
Thanks
One way is to use loc
and wrap your conditions in parentheses and use the bitwise oerator &
, the bitwise operator is required as you are comparing an array of values and not a single value, the parentheses are required due to operator precedence. We can then use this to perform label selection using loc
and set the 'C' column like so:
In [15]:
import datetime as dt
start = dt.datetime.strptime('2014-01-02 14:07:00', '%Y-%m-%d %H:%M:%S')
end = dt.datetime.strptime('2014-01-02 14:22:00', '%Y-%m-%d %H:%M:%S')
df.loc[(df.index > start) & (df.index < end), 'C'] = 100
df
Out[15]:
A B C
TIME
2014-01-02 14:00:00 -1.172285 1.706200 NaN
2014-01-02 14:05:00 0.039511 -0.320798 NaN
2014-01-02 14:10:00 -0.192179 -0.539397 100
2014-01-02 14:15:00 -0.475917 -0.280055 100
2014-01-02 14:20:00 0.163376 1.124602 100
2014-01-02 14:25:00 -2.477812 0.656750 NaN
If we look at each method you tried and why they didn't work:
sdf = df[start:end] # will raise KeyError if start and end are not present in index
sdf = df[start < df.index < end] # will raise ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(), this is because you are comparing arrays of values not a single scalar value
sdf = df.ix[start:end] # raises KeyError same as first example
sdf = df.loc[start:end] # raises KeyError same as first example
sdf = df.truncate(before=start, after=end, copy=False) # generates correct result but operations on this will raise SettingWithCopyWarning as you've found
EDIT
You can set sdf
to the mask and use this with loc
to set your 'C' column:
In [7]:
import datetime as dt
start = dt.datetime.strptime('2014-01-02 14:07:00', '%Y-%m-%d %H:%M:%S')
end = dt.datetime.strptime('2014-01-02 14:22:00', '%Y-%m-%d %H:%M:%S')
sdf = (df.index > start) & (df.index < end)
df.loc[sdf,'C'] = 100
df
Out[7]:
A B C
TIME
2014-01-02 14:00:00 -1.172285 1.706200 NaN
2014-01-02 14:05:00 0.039511 -0.320798 NaN
2014-01-02 14:10:00 -0.192179 -0.539397 100
2014-01-02 14:15:00 -0.475917 -0.280055 100
2014-01-02 14:20:00 0.163376 1.124602 100
2014-01-02 14:25:00 -2.477812 0.656750 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With