suppose i have below df:
import pandas as pd
data_dic = {
"a": [0,0,1,2],
"b": [0,3,4,5],
"c": [6,7,8,9]
}
df = pd.DataFrame(data_dic)
Result:
a b c
0 0 0 6
1 0 3 7
2 1 4 8
3 2 5 9
I need to past value to new column from above columns based on conditions:
if df.a > 0 then value df.a
else if df.b > 0 then value df.b
else value df.c
For now i try with:
df['value'] = [x if x > 0 else 'ww' for x in df['a']]
but don't know how to input more conditions in this.
Expected result:
a b c value
0 0 0 6 6
1 0 3 7 3
2 1 4 8 1
3 2 5 9 2
Thank You for hard work.
Select Duplicate Rows Based on All Columns You can use df[df. duplicated()] without any arguments to get rows with the same values on all columns. It takes defaults values subset=None and keep='first' . The below example returns two rows as these are duplicate rows in our DataFrame.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
To find duplicates on a specific column, we can simply call duplicated() method on the column. The result is a boolean Series with the value True denoting duplicate. In other words, the value True means the entry is identical to a previous one.
Use numpy.select
:
df['value'] = np.select([df.a > 0 , df.b > 0], [df.a, df.b], default=df.c)
print (df)
a b c value
0 0 0 6 6
1 0 3 7 3
2 1 4 8 1
3 2 5 9 2
Difference between vectorized and loop solutions in 400k rows:
df = pd.concat([df] * 100000, ignore_index=True)
In [158]: %timeit df['value2'] = np.select([df.a > 0 , df.b > 0], [df.a, df.b], default=df.c)
9.86 ms ± 611 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [159]: %timeit df['value1'] = [x if x > 0 else y if y>0 else z for x,y,z in zip(df['a'],df['b'],df['c'])]
399 ms ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can also use list comprehension:
df['value'] = [x if x > 0 else y if y>0 else z for x,y,z in zip(df['a'],df['b'],df['c'])]
You can write a function that takes a row in as a parameter, tests whatever conditions you want to test, and returns a True
or False
result - which you can then use as a selection tool. (Though on rereading of your question, this may not be what you're looking for - see part 2 below)
Perform a Selection
apply
this function to your dataframe, and use the returned series of True/False answers as an index to select values from the actual dataframe itself.
e.g.
def selector(row):
if row['a'] > 0 and row['b'] == 3 :
return True
elif row['c'] > 2:
return True
else:
return False
You can build whatever logic you like, just ensure it returns True when you want a match and False when you don't.
Then try something like
df.apply(lambda row : selector(row), axis=1)
And it will return a Series of True-False answers. Plug that into your df to select only those rows that have a True
value calculated for them.
df[df.apply(lambda row : selector(row), axis=1)]
And that should give you what you want.
Part 2 - Perform a Calculation
If you want to create a new column containing some calculated result - then it's a similar operation, create a function that performs your calculation:
def mycalc(row):
if row['a'] > 5 :
return row['a'] + row['b']
else:
return 66
Only this time, apply
the result and assign it to a new column name:
df['value'] = df.apply( lambda row : mycalc(row), axis = 1)
And this will give you that result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With