Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Duplicating some rows and changing some values in pandas

I have a pandas DataFrame looking like this:

From    To    Val
GE      VD    1000
GE      VS    1600
VS      VD    1500
VS      GE     600
VD      GE    1200
VD      VS    1300

I would like to replace every line that does not have "GE" in the "from" or "to" column, by two lines, one having "GE" in the "from" column and one having "GE" in the "to" column. In the example above, I would replace the third line by the following two lines:
GE VD 1500
VS GE 1500

I tried using "apply" but I can't figure out how to return a correct data frame. For example

def myfun(row):
    if "GE" not in (row["from"], row["to"]):
        row1=pd.DataFrame(row).T
        row2=row1.copy()
        row1["from"]="GE"
        row2["to"]="GE"
        return pd.concat([row1, row2])
    else:
        return pd.DataFrame(row).T

Gives a strange result:

>> df.apply(myfun, axis=1)
   Val  from  to
0  Val  from  to
1  Val  from  to
2  Val  from  to
3  Val  from  to
4  Val  from  to
5  Val  from  to

Although my function seems correct:

>> myfun(df.loc[5])
  Val from  to
5  13   GE  VD
5  13   VS  GE

I can think of a way of doing it by filtering my dataframe in two sub dataframes, one with rows needing duplication and one with the others. Then duplicating the first dataframe, making the changes and collating all three DF together. But it's ugly. Can anyone suggest a more elegant way?

In other words, can the applied function return a DataFrame, as in R we would do with ddply?

Thanks

like image 608
user3190381 Avatar asked Jan 13 '14 14:01

user3190381


2 Answers

Filtering:

In [153]: sub = df[(~df[['From', 'To']].isin(['GE'])).all(1)]

In [154]: sub
Out[154]: 
  From  To   Val
2   VS  VD  1500
5   VD  VS  1300

[2 rows x 3 columns]


In [179]: good = df.ix[df.index - sub.index]

In [180]: good
Out[180]: 
  From  To   Val
0   GE  VD  1000
1   GE  VS  1600
3   VS  GE   600
4   VD  GE  1200

[4 rows x 3 columns]

Define a function that gives the desired values as a DataFrame:

def new_df(row):
    return pd.DataFrame({"From": ["GE", row["From"]],
                         "To": [row["To"], "GE"],
                         "Val": [row["Val"], row["Val"]]})

Apply that function to the rows:

In [181]: new = pd.concat([new_df(y) for _, y in x.iterrows()], axis=0, ignore_index=True)

In [182]: new
Out[182]: 
  From  To   Val
0   GE  VD  1500
1   VS  GE  1500
2   GE  VS  1300
3   VD  GE  1300

[4 rows x 3 columns]

And concat together

In [183]: pd.concat([good, new], axis=0, ignore_index=True)
Out[183]: 
  From  To   Val
0   GE  VD  1000
1   GE  VS  1600
2   VS  GE   600
3   VD  GE  1200
4   GE  VD  1500
5   VS  GE  1500
6   GE  VS  1300
7   VD  GE  1300

[8 rows x 3 columns]
like image 86
TomAugspurger Avatar answered Oct 05 '22 07:10

TomAugspurger


This uses two passes through. It could be shortened if you added an else conditional that concatenated the rows that will be kept unchanged. However, I find this more readable, and since we're using itertuples to go over the rows, the cost here is linear and we're only forming each tuple as needed (not a big list of tuples for all rows simultaneously).

Similarly, you could pop a row inside of the if statement and concatenate the two new rows in its place back onto the original data object df, so that you don't incur the memory cost of creating keeper_rows. It's just not usually worth it to make these kinds of optimizations for a task like this unless the DataFrame is gigantic.

keeper_rows = df.ix[[i for i,x in enumerate(df.itertuples()) if 'GE' in x[0:2]]]
for row_as_tuple in df.itertuples():
    from_other, to_other, val = row_as_tuple
    if "GE" not in (from_other, to_other):
        new_rows = {"From":["GE", from_other], 
                    "To"  :[to_other, "GE"], 
                    "Val" :[val, val]}
        keeper_rows = pandas.concat([keeper_rows, pandas.DataFrame(new_rows)], 
                                    ignore_index=True)
like image 25
ely Avatar answered Oct 05 '22 07:10

ely