Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas operations on columns

Tags:

python

pandas

Hi I would like to know the best way to do operations on columns in python using pandas.

I have a classical database which I have loaded as a dataframe, and I often have to do operations such as for each row, if value in column labeled 'A' is greater than x then replace this value by column'C' minus column 'D'

for now I do something like

for i in len(df.index):
    if df.ix[i,'A'] > x :
        df.ix[i,'A'] = df.ix[i,'C'] - df.ix[i, 'D']

I would like to know if there is a simpler way of doing these kind of operations and more importantly the most effective one as I have large databases

I had tried without the for i loop, like in R or Stata, I was advised to use "a.any" or "a.all" but I did non find anything either here or in the pandas docs.

Thanks by advance.

like image 843
Anthony Martin Avatar asked Aug 12 '13 07:08

Anthony Martin


2 Answers

You can just use a boolean mask with either the .loc or .ix attributes of the DataFrame.

mask = df['A'] > 2
df.ix[mask, 'A'] = df.ix[mask, 'C'] - df.ix[mask, 'D']

If you have a lot of branching things then you can do:

def func(row):
    if row['A'] > 0:
        return row['B'] + row['C']
    elif row['B'] < 0:
        return row['D'] + row['A']
    else:
        return row['A']

df['A'] = df.apply(func, axis=1)

apply should generally be much faster than a for loop.

like image 74
Viktor Kerkez Avatar answered Sep 20 '22 10:09

Viktor Kerkez


simplest according to me.

from random import randint, randrange, uniform
import pandas as pd
import numpy as np

df = pd.DataFrame({'a':randrange(0,10),'b':randrange(10,20),'c':np.random.randn(10)})

#If colC > 0,5, then ColC = ColB - Cola 
df['c'][df['c'] > 0.5] = df['b'] - df['a']

Tested, it works.

a   b   c
2  11 -0.576309
2  11 -0.578449
2  11 -1.085822
2  11  9.000000
2  11  9.000000
2  11 -1.081405
like image 43
Amrita Sawant Avatar answered Sep 20 '22 10:09

Amrita Sawant