Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classify data by value in pandas

Tags:

python

pandas

I have a pandas.DataFrame of the form

low_bound   high_bound   name
0           10           'a'
10          20           'b'
20          30           'c'
30          40           'd'
40          50           'e'

I have a very long pandas.Series of the form:

value
5.7
30.4
21
35.1

I want to give to each value of the Series its corresponding name with respect to the low_bound/high_bound/name DataFrame. Here is my expected result:

value         name
5.7           'a'
30.4          'd'
21            'c'
35.1          'd'

Indeed, 5.7 name is 'a' since 5.7 is between 0 and 10 excluded.

What would be the most efficient code? I know I can solve the problem by iterating through the Series, but maybe there is a quicker vectorial solution which is escaping me.

Note finally that my bounds can be custom and irregular. Here they are regular for the sake of the example.

like image 551
sweeeeeet Avatar asked Apr 05 '16 09:04

sweeeeeet


People also ask

What does .values in pandas do?

Definition and Usage The values property returns all values in the DataFrame. The return value is a 2-dimensional array with one array for each row.

What does Dtypes do in pandas?

The dtypes property returns data type of each column in the DataFrame.


1 Answers

Pandas has a method called cut that will do what you want:

import pandas as pd

data = [{"low": 0, "high": 10, "name": "a"},
        {"low": 10, "high": 20, "name": "b"},
        {"low": 20, "high": 30, "name": "c"},
        {"low": 30, "high": 40, "name": "d"},
        {"low": 40, "high": 50, "name": "e"},]

myDF = pd.DataFrame(data)

#data to be binned
mySeries = pd.Series([5.7, 30.4, 21, 35.1])

#create bins from original data
bins = list(myDF["high"])
bins.insert(0,0)

print pd.cut(mySeries, bins, labels = myDF["name"])

That will give you the following, which you can then put back into some dataframe or however you want to hold your data:

0    a
1    d
2    c
3    d
dtype: category
Categories (5, object): [a < b < c < d < e]

Depending on how irregular your bins are (and what you mean exactly by custom/irregular), you might have to resort to looping through the series. I can't think off the top of my head of a builtin that will handle this for you, especially given that it depends on the degree/type of irregularity in the bins.

Looping wise, this method will work if you have a lower and upper bound, regardless of "regularity":

for el in mySeries:
    print myDF["name"][(myDF["low"] < el) & (myDF["high"] > el)]

I appreciate that you might not want to loop through a huge series, but at least we're not manually indexing into the dataframe, which would probably make things even slower

like image 89
Simon Avatar answered Sep 21 '22 06:09

Simon