Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas element-wise comparison and create selection

In a dataframe I would like to compare the elements of a column with a value and sort the elements which pass the comparison into a new column.

df = pandas.DataFrame([{'A':3,'B':10},
                       {'A':2, 'B':30},
                       {'A':1,'B':20},
                       {'A':2,'B':15},
                       {'A':2,'B':100}])

df['C'] = [x for x in df['B'] if x > 18]

I can't find out what's wrongs and why I get:

ValueError: Length of values does not match length of index

like image 696
mati Avatar asked May 26 '26 17:05

mati


2 Answers

I think you can use loc with boolean indexing:

print (df)
   A    B
0  3   10
1  2   30
2  1   20
3  2   15
4  2  100

print (df['B'] > 18)
0    False
1     True
2     True
3    False
4     True
Name: B, dtype: bool

df.loc[df['B'] > 18, 'C'] = df['B']
print (df)
   A    B      C
0  3   10    NaN
1  2   30   30.0
2  1   20   20.0
3  2   15    NaN
4  2  100  100.0

If you need select by condition use boolean indexing:

print (df[df['B'] > 18])
   A    B
1  2   30
2  1   20
4  2  100

If need something more faster, use where:

df['C'] = df.B.where(df['B'] > 18)

Timings (len(df)=50k):

In [1367]: %timeit (a(df))
The slowest run took 8.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.14 ms per loop

In [1368]: %timeit (b(df1))
100 loops, best of 3: 15.5 ms per loop

In [1369]: %timeit (c(df2))
100 loops, best of 3: 2.93 ms per loop

Code for timings:

import pandas as pd

df = pd.DataFrame([{'A':3,'B':10},
                       {'A':2, 'B':30},
                       {'A':1,'B':20},
                       {'A':2,'B':15},
                       {'A':2,'B':100}])
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()

def a(df):
    df['C'] = df.B.where(df['B'] > 18)
    return df

def b(df1):    
    df['C'] = ([x if x > 18 else None for x in df['B']])
    return df

def c(df2):    
    df.loc[df['B'] > 18, 'C'] = df['B']
    return df

print (a(df))
print (b(df1))
print (c(df2))
like image 60
jezrael Avatar answered May 30 '26 02:05

jezrael


As Darren mentioned, all columns in a DataFrame should have same length.

When you try print [x for x in df['B'] if x > 18], you get only [30, 20, 100] values. But you have got five index/rows. That's the reason you get Length of values does not match length of index error.

You can change your code as follows:

df['C'] = [x if x > 18 else None for x in df['B']]
print df

You will get:

   A    B      C
0  3   10    NaN
1  2   30   30.0
2  1   20   20.0
3  2   15    NaN
4  2  100  100.0
like image 39
Saranya Krishnamurthy Avatar answered May 30 '26 02:05

Saranya Krishnamurthy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!