I have a pandas.DataFrame
of the form
low_bound high_bound name
0 10 'a'
10 20 'b'
20 30 'c'
30 40 'd'
40 50 'e'
I have a very long pandas.Series
of the form:
value
5.7
30.4
21
35.1
I want to give to each value of the Series its corresponding name with respect to the low_bound/high_bound/name DataFrame. Here is my expected result:
value name
5.7 'a'
30.4 'd'
21 'c'
35.1 'd'
Indeed, 5.7 name is 'a' since 5.7 is between 0 and 10 excluded.
What would be the most efficient code? I know I can solve the problem by iterating through the Series, but maybe there is a quicker vectorial solution which is escaping me.
Note finally that my bounds can be custom and irregular. Here they are regular for the sake of the example.
Definition and Usage The values property returns all values in the DataFrame. The return value is a 2-dimensional array with one array for each row.
The dtypes property returns data type of each column in the DataFrame.
Pandas has a method called cut
that will do what you want:
import pandas as pd
data = [{"low": 0, "high": 10, "name": "a"},
{"low": 10, "high": 20, "name": "b"},
{"low": 20, "high": 30, "name": "c"},
{"low": 30, "high": 40, "name": "d"},
{"low": 40, "high": 50, "name": "e"},]
myDF = pd.DataFrame(data)
#data to be binned
mySeries = pd.Series([5.7, 30.4, 21, 35.1])
#create bins from original data
bins = list(myDF["high"])
bins.insert(0,0)
print pd.cut(mySeries, bins, labels = myDF["name"])
That will give you the following, which you can then put back into some dataframe or however you want to hold your data:
0 a
1 d
2 c
3 d
dtype: category
Categories (5, object): [a < b < c < d < e]
Depending on how irregular your bins are (and what you mean exactly by custom/irregular), you might have to resort to looping through the series. I can't think off the top of my head of a builtin that will handle this for you, especially given that it depends on the degree/type of irregularity in the bins.
Looping wise, this method will work if you have a lower and upper bound, regardless of "regularity":
for el in mySeries:
print myDF["name"][(myDF["low"] < el) & (myDF["high"] > el)]
I appreciate that you might not want to loop through a huge series, but at least we're not manually indexing into the dataframe, which would probably make things even slower
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With