I have a <code>pandas.DataFrame</code> of the form <pre class="prettyprint"><code>low_bound high_bound name 0 10 'a' 10 20 'b' 20 30 'c' 30 40 'd' 40 50 'e' </code></pre> I have a very long <code>pandas.Series</code> of the form: <pre class="prettyprint"><code>value 5.7 30.4 21 35.1 </code></pre> I want to give to each value of the Series its corresponding name with respect to the low_bound/high_bound/name DataFrame. Here is my expected result: <pre class="prettyprint"><code>value name 5.7 'a' 30.4 'd' 21 'c' 35.1 'd' </code></pre> Indeed, 5.7 name is 'a' since 5.7 is between 0 and 10 excluded. What would be the most efficient code? I know I can solve the problem by iterating through the Series, but maybe there is a quicker vectorial solution which is escaping me. Note finally that my bounds can be custom and irregular. Here they are regular for the sake of the example.

Pandas has a method called <code>cut</code> that will do what you want: <pre class="prettyprint"><code>import pandas as pd data = [{"low": 0, "high": 10, "name": "a"}, {"low": 10, "high": 20, "name": "b"}, {"low": 20, "high": 30, "name": "c"}, {"low": 30, "high": 40, "name": "d"}, {"low": 40, "high": 50, "name": "e"},] myDF = pd.DataFrame(data) #data to be binned mySeries = pd.Series([5.7, 30.4, 21, 35.1]) #create bins from original data bins = list(myDF["high"]) bins.insert(0,0) print pd.cut(mySeries, bins, labels = myDF["name"]) </code></pre> That will give you the following, which you can then put back into some dataframe or however you want to hold your data: <pre class="prettyprint"><code>0 a 1 d 2 c 3 d dtype: category Categories (5, object): [a </pre> Depending on how irregular your bins are (and what you mean exactly by custom/irregular), you might have to resort to looping through the series. I can't think off the top of my head of a builtin that will handle this for you, especially given that it depends on the degree/type of irregularity in the bins. Looping wise, this method will work if you have a lower and upper bound, regardless of "regularity": <pre class="prettyprint"><code>for el in mySeries: print myDF["name"][(myDF["low"] < el) & (myDF["high"] > el)] </code></pre> I appreciate that you might not want to loop through a huge series, but at least we're not manually indexing into the dataframe, which would probably make things even slower

Classify data by value in pandas

Tags:

python

pandas

I have a pandas.DataFrame of the form

low_bound   high_bound   name
0           10           'a'
10          20           'b'
20          30           'c'
30          40           'd'
40          50           'e'

I have a very long pandas.Series of the form:

value
5.7
30.4
21
35.1

I want to give to each value of the Series its corresponding name with respect to the low_bound/high_bound/name DataFrame. Here is my expected result:

value         name
5.7           'a'
30.4          'd'
21            'c'
35.1          'd'

Indeed, 5.7 name is 'a' since 5.7 is between 0 and 10 excluded.

What would be the most efficient code? I know I can solve the problem by iterating through the Series, but maybe there is a quicker vectorial solution which is escaping me.

Note finally that my bounds can be custom and irregular. Here they are regular for the sake of the example.

551

asked Apr 05 '16 09:04

sweeeeeet

1 Answers

Pandas has a method called cut that will do what you want:

import pandas as pd

data = [{"low": 0, "high": 10, "name": "a"},
        {"low": 10, "high": 20, "name": "b"},
        {"low": 20, "high": 30, "name": "c"},
        {"low": 30, "high": 40, "name": "d"},
        {"low": 40, "high": 50, "name": "e"},]

myDF = pd.DataFrame(data)

#data to be binned
mySeries = pd.Series([5.7, 30.4, 21, 35.1])

#create bins from original data
bins = list(myDF["high"])
bins.insert(0,0)

print pd.cut(mySeries, bins, labels = myDF["name"])

That will give you the following, which you can then put back into some dataframe or however you want to hold your data:

0    a
1    d
2    c
3    d
dtype: category
Categories (5, object): [a < b < c < d < e]

Depending on how irregular your bins are (and what you mean exactly by custom/irregular), you might have to resort to looping through the series. I can't think off the top of my head of a builtin that will handle this for you, especially given that it depends on the degree/type of irregularity in the bins.

Looping wise, this method will work if you have a lower and upper bound, regardless of "regularity":

for el in mySeries:
    print myDF["name"][(myDF["low"] < el) & (myDF["high"] > el)]

I appreciate that you might not want to loop through a huge series, but at least we're not manually indexing into the dataframe, which would probably make things even slower

answered Sep 21 '22 06:09

Simon

Related questions
                            
                                Locating table with no id or class attributes
                            
                                Django ignore extra arguments on constructing model
                            
                                how to get Python XMLGenerator to output CDATA
                            
                                Replacing characters from string one to string two
                            
                                django restframework :getting NotImplementedError
                            
                                Simple Python String (Backward) Slicing
                            
                                Elegant way to replace values in pandas.DataFrame from another DataFrame
                            
                                Generate all combinations of nucleotide k-mers between range(i, j)
                            
                                Python Pandas: How can I group by and assign an id to all the items in a group?
                            
                                python - how does one include a variable in the doc string?
                            
                                why is 1e400 not an int?
                            
                                Python Requests Mock doesn't catch Timeout exception
                            
                                Django groups and permissions
                            
                                Django: Non-staff users can login to admin page
                            
                                Enumerable for negative ranges
                            
                                In SQLAlchemy, how does the dict update method interact with the ORM?
                            
                                Best way to implement numpy.sin(x) / x where x might contain 0
                            
                                How to flatten a list of tuples and remove the duplicates?
                            
                                How to convert an InMemoryUploadedFile in django to a fomat for flickr API?
                            
                                Scrapy: How to run spider from other python script twice or more？

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With