There are four cars: bmw
, geo
, vw
and porsche
:
import pandas as pd
df = pd.DataFrame({
'car': ['bmw','geo','vw','porsche'],
'warranty': ['yes','yes','yes','no'],
'dvd': ['yes','yes','no','yes'],
'sunroof': ['yes','no','no','no']})
I would like to create a filtered DataFrame that lists only those cars that have all three features presented: the DVD player, the sunroof and a warranty (we know it is BMW here that has all features set to 'yes').
I can do one column at time with:
cars_with_warranty = df['car'][df['warranty']=='yes']
print(cars_with_warranty)
Then I need to do a similar column calculation for dvd and sunroof columns:
cars_with_dvd = df['car'][df['dvd']=='yes']
cars_with_sunroof = df['car'][df['sunroof']=='yes']
I wonder if there is a clever way of creating the filtered DataFrame
?
The posted solution works well. But the resulting cars_with_all_three
is a simple list variable. We need the DataFrame object with a single 'bmw' car as its the only row and all three columns in place: dvd, sunroof and warranty (with all three values set to 'yes').
cars_with_all_three = []
for ind, car in enumerate(df['car']):
if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes':
cars_with_all_three.append(car)
You can use boolean indexing
:
print ((df.dvd == 'yes') & (df.sunroof == 'yes') & (df.warranty == 'yes'))
0 True
1 False
2 False
3 False
dtype: bool
print (df[(df.dvd == 'yes') & (df.sunroof == 'yes') & (df.warranty == 'yes')])
car dvd sunroof warranty
0 bmw yes yes yes
#if need filter only column 'car'
print (df.ix[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes'), 'car'])
0 bmw
Name: car, dtype: object
Another solution with checking if all values in columns are yes
and then check if all values are True
by all
:
print ((df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1))
0 True
1 False
2 False
3 False
dtype: bool
print (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
car dvd sunroof warranty
0 bmw yes yes yes
print (df.ix[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1), 'car'])
0 bmw
Name: car, dtype: object
Solution with minimal code, if DataFrame
has only 4
columns like sample:
print (df[(df.set_index('car') == 'yes').all(1).values])
car dvd sunroof warranty
0 bmw yes yes yes
Timings:
In [44]: %timeit ([car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes'])
10 loops, best of 3: 120 ms per loop
In [45]: %timeit (df[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes')])
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.09 ms per loop
In [46]: %timeit (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
1000 loops, best of 3: 1.53 ms per loop
In [47]: %timeit (df[(df.ix[:, [u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.51 ms per loop
In [48]: %timeit (df[(df.set_index('car') == 'yes').all(1).values])
1000 loops, best of 3: 1.64 ms per loop
In [49]: %timeit (mer(df))
The slowest run took 4.17 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3.85 ms per loop
Code for timings:
df = pd.DataFrame({
'car': ['bmw','geo','vw','porsche'],
'warranty': ['yes','yes','yes','no'],
'dvd': ['yes','yes','no','yes'],
'sunroof': ['yes','no','no','no']})
print (df)
df = pd.concat([df]*1000).reset_index(drop=True)
def mer(df):
df = df.set_index('car')
return df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With