I have a a pd.DataFrame that looks like:
I want to create a cutoff on the values to push them into binary digits, my cutoff in this case is 0.85
. I want the resulting dataframe to look like:
The script I wrote to do this is easy to understand but for large datasets it is inefficient. I'm sure Pandas has some way of taking care of these types of transformations.
Does anyone know of an efficient way to convert a column of floats to a column of integers using a threshold?
My extremely naive way of doing such a thing:
DF_test = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0.12,0.23,0.93,0.86,0.33]]).T,columns=["c1","c2","value"])
DF_want = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0,0,1,1,0]]).T,columns=["c1","c2","value"])
threshold = 0.85
#Empty dataframe to append rows
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
#Get first 2 columns
first2cols = list(DF_test.ix[i][:-1])
#Check if value is greater than threshold
binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
#Create series object
SR_row = pd.Series( first2cols + binary_value,name=i)
#Add to empty dataframe container
DF_naive = DF_naive.append(SR_row)
#Relabel columns
DF_naive.columns = DF_test.columns
DF_naive.head()
#the sample DF_want
You can use np.where
to set your desired value based on a boolean condition:
In [18]:
DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)
DF_test
Out[18]:
c1 c2 value
0 a p 0
1 b q 0
2 c r 1
3 d s 1
4 e t 0
Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:
In [58]:
DF_test.iloc[0]['value']
Out[58]:
'0.12'
So you'll need to convert the dtype
to float
first: DF_test['value'] = DF_test['value'].astype(float)
You can compare the timings:
In [16]:
%timeit np.where(DF_test['value'] > threshold, 1,0)
1000 loops, best of 3: 297 µs per loop
In [17]:
%%timeit
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
#Get first 2 columns
first2cols = list(DF_test.ix[i][:-1])
#Check if value is greater than threshold
binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
#Create series object
SR_row = pd.Series( first2cols + binary_value,name=i)
#Add to empty dataframe container
DF_naive = DF_naive.append(SR_row)
10 loops, best of 3: 39.3 ms per loop
the np.where
version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point
Since bool
is a subclass of int
, i.e. True == 1
and False == 0
, you can convert a Boolean series to its integer form:
DF_test['value'] = (DF_test['value'] > threshold).astype(int)
Generally, including most uses in computation or indexing, the int
conversion is not necessary and you may wish to forego it altogether.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With