I have a Pandas DataFrame with a column with TimeStamps. I can select date ranges from this column. But after I make change to other columns in the DataFrame, I can no longer and I get the error "TypeError: '>' not supported between instances of 'int' and 'str'".
The code below reproduce the problem:
Select on the date column
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
All good:
0 1 2 date
153 0.280575 0.810817 0.534509 2000-06-02
154 0.490319 0.873906 0.465698 2000-06-03
155 0.070790 0.898340 0.390777 2000-06-04
156 0.896007 0.824134 0.134484 2000-06-05
157 0.539633 0.814883 0.976257 2000-06-06
158 0.772454 0.420732 0.499719 2000-06-07
159 0.498020 0.495946 0.546043 2000-06-08
160 0.562385 0.460190 0.480170 2000-06-09
161 0.924412 0.611929 0.459360 2000-06-10
However, now I set column 0 to 0 if it exceeds 0.7 and repeat:
df[df[0] > 0.7] = 0
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
This gives the error:
TypeError: '>' not supported between instances of 'int' and 'str'
Why does this happen and how do I avoid it?
If check output problem is datetimes
are set by 0
, because no columns for set are specified, so pandas set all columns:
df[df[0] > 0.7] = 0
print (df.head(10))
0 1 2 date
0 0.420593 0.519151 0.149883 2000-01-01 00:00:00
1 0.014364 0.503533 0.601206 2000-01-02 00:00:00
2 0.099144 0.090100 0.799383 2000-01-03 00:00:00
3 0.411158 0.144419 0.964909 2000-01-04 00:00:00
4 0.151470 0.424896 0.376281 2000-01-05 00:00:00
5 0.000000 0.000000 0.000000 0
6 0.292871 0.868168 0.353377 2000-01-07 00:00:00
7 0.536018 0.737273 0.356857 2000-01-08 00:00:00
8 0.364068 0.314311 0.475165 2000-01-09 00:00:00
9 0.000000 0.000000 0.000000 0
Solution is set only numeric columns by DataFrame.select_dtypes
:
df.loc[df[0] > 0.7, df.select_dtypes(np.number).columns] = 0
#or specify columns by list
#df.loc[df[0] > 0.7, [0,1]] = 0
print (df.head(10))
0 1 2 date
0 0.416697 0.459268 0.146755 2000-01-01
1 0.645391 0.742737 0.023878 2000-01-02
2 0.000000 0.000000 0.000000 2000-01-03
3 0.456387 0.996946 0.450155 2000-01-04
4 0.000000 0.000000 0.000000 2000-01-05
5 0.000000 0.000000 0.000000 2000-01-06
6 0.265673 0.951874 0.175133 2000-01-07
7 0.434855 0.762386 0.653668 2000-01-08
8 0.000000 0.000000 0.000000 2000-01-09
9 0.000000 0.000000 0.000000 2000-01-10
Another solution is create DatetimeIndex
if all another columns are numeric:
df = df.set_index('date')
df.loc[df[0] > 0.7] = 0
print (df.head(10))
0 1 2
date
2000-01-01 0.316875 0.584754 0.925727
2000-01-02 0.000000 0.000000 0.000000
2000-01-03 0.326266 0.746555 0.825070
2000-01-04 0.492115 0.508553 0.971966
2000-01-05 0.160850 0.403678 0.107497
2000-01-06 0.000000 0.000000 0.000000
2000-01-07 0.047433 0.103412 0.789594
2000-01-08 0.527788 0.415356 0.926681
2000-01-09 0.468794 0.458531 0.435696
2000-01-10 0.261224 0.599815 0.435548
You can compare a timestamp (Timestamp('2000-01-01 00:00:00')
) to a string, pandas will convert the string to Timestamp
for you. But once you set the value to 0
, you cannot compare an int
to a str
.
Another way to go around this is to change order of your operations.
filters = df[0] > 0.7
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
df[filters] = 0
print(df.loc[mask & filters])
Also, you mentioned you want to set column 0 to 0 if it exceeds 0.7, so df[df[0]>0.7] = 0
does not do exactly what you want: it sets the entire rows to 0
. Instead:
df.loc[df[0] > 0.7, 0] = 0
Then you should not have any problem with the original mask.
For me it was the issue of loop.
Make sure when you grab particular filtered dataframe change your data-time columns into date-time columns by
df_new['date-like_column'] = pd.to_datetime(df_new['date-like-column'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With