Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

classifying a series to a new column in pandas

Tags:

python

pandas

I want to be able to take my current set of data, which is filled with ints, and classify them according to certain criteria. The table looks something like this:

[in]> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
[out]>
   A  B  C
0  0  1  0
1  2  0  0
2  3  2  1
3  2  0  0
4  0  0  1
5  0  0  0

I'd like to classify these in a separate column by string. Being more familiar with R, I tried to create a new column with the rules in that column's definition. Following that I attempted with .ix and lambdas which both resulted in a type errors (between ints & series ). I'm under the impression that this is a fairly simple question. Although the following is completely wrong, here is the logic from attempt 1:

df['D']=(
if ((df['A'] > 0) & (df['B'] == 0) & df['C']==0): 
    return "c1";
elif ((df['A'] == 0) & ((df['B'] > 0) | df['C'] >0)): 
    return "c2";
else:
    return "c3";)

for a final result of:

   A  B  C     D
0  0  1  0  "c2"
1  2  0  0  "c1"
2  3  2  1  "c3"
3  2  0  0  "c1"
4  0  0  1  "c2"
5  0  0  0  "c3"

If someone could help me figure this out it would be much appreciated.

like image 944
stites Avatar asked Mar 07 '13 20:03

stites


People also ask

Can you merge a series to a DataFrame pandas?

By using pandas. concat() you can combine pandas objects for example multiple series along a particular axis (column-wise or row-wise) to create a DataFrame. concat() method takes several params, for our scenario we use list that takes series to combine and axis=1 to specify merge series as columns instead of rows.

Can pandas series have different data types?

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).


1 Answers

I can think of two ways. The first is to write a classifier function and then .apply it row-wise:

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
>>> 
>>> def classifier(row):
...         if row["A"] > 0 and row["B"] == 0 and row["C"] == 0:
...                 return "c1"
...         elif row["A"] == 0 and (row["B"] > 0 or row["C"] > 0):
...                 return "c2"
...         else:
...                 return "c3"
...     
>>> df["D"] = df.apply(classifier, axis=1)
>>> df
   A  B  C   D
0  0  1  0  c2
1  2  0  0  c1
2  3  2  1  c3
3  2  0  0  c1
4  0  0  1  c2
5  0  0  0  c3

and the second is to use advanced indexing:

>>> df = pd.DataFrame({'A':[0,2,3,2,0,0],'B': [1,0,2,0,0,0],'C': [0,0,1,0,1,0]})
>>> df["D"] = "c3"
>>> df["D"][(df["A"] > 0) & (df["B"] == 0) & (df["C"] == 0)] = "c1"
>>> df["D"][(df["A"] == 0) & ((df["B"] > 0) | (df["C"] > 0))] = "c2"
>>> df
   A  B  C   D
0  0  1  0  c2
1  2  0  0  c1
2  3  2  1  c3
3  2  0  0  c1
4  0  0  1  c2
5  0  0  0  c3

Which one is clearer depends upon the situation. Usually the more complex the logic the more likely I am to wrap it up in a function I can then document and test.

like image 173
DSM Avatar answered Sep 28 '22 09:09

DSM