I have a pandas DataFrame looking like this:
From To Val
GE VD 1000
GE VS 1600
VS VD 1500
VS GE 600
VD GE 1200
VD VS 1300
I would like to replace every line that does not have "GE" in the "from" or "to" column, by two lines, one having "GE" in the "from" column and one having "GE" in the "to" column.
In the example above, I would replace the third line by the following two lines:
GE VD 1500
VS GE 1500
I tried using "apply" but I can't figure out how to return a correct data frame. For example
def myfun(row):
if "GE" not in (row["from"], row["to"]):
row1=pd.DataFrame(row).T
row2=row1.copy()
row1["from"]="GE"
row2["to"]="GE"
return pd.concat([row1, row2])
else:
return pd.DataFrame(row).T
Gives a strange result:
>> df.apply(myfun, axis=1)
Val from to
0 Val from to
1 Val from to
2 Val from to
3 Val from to
4 Val from to
5 Val from to
Although my function seems correct:
>> myfun(df.loc[5])
Val from to
5 13 GE VD
5 13 VS GE
I can think of a way of doing it by filtering my dataframe in two sub dataframes, one with rows needing duplication and one with the others. Then duplicating the first dataframe, making the changes and collating all three DF together. But it's ugly. Can anyone suggest a more elegant way?
In other words, can the applied function return a DataFrame, as in R we would do with ddply?
Thanks
Filtering:
In [153]: sub = df[(~df[['From', 'To']].isin(['GE'])).all(1)]
In [154]: sub
Out[154]:
From To Val
2 VS VD 1500
5 VD VS 1300
[2 rows x 3 columns]
In [179]: good = df.ix[df.index - sub.index]
In [180]: good
Out[180]:
From To Val
0 GE VD 1000
1 GE VS 1600
3 VS GE 600
4 VD GE 1200
[4 rows x 3 columns]
Define a function that gives the desired values as a DataFrame:
def new_df(row):
return pd.DataFrame({"From": ["GE", row["From"]],
"To": [row["To"], "GE"],
"Val": [row["Val"], row["Val"]]})
Apply that function to the rows:
In [181]: new = pd.concat([new_df(y) for _, y in x.iterrows()], axis=0, ignore_index=True)
In [182]: new
Out[182]:
From To Val
0 GE VD 1500
1 VS GE 1500
2 GE VS 1300
3 VD GE 1300
[4 rows x 3 columns]
And concat together
In [183]: pd.concat([good, new], axis=0, ignore_index=True)
Out[183]:
From To Val
0 GE VD 1000
1 GE VS 1600
2 VS GE 600
3 VD GE 1200
4 GE VD 1500
5 VS GE 1500
6 GE VS 1300
7 VD GE 1300
[8 rows x 3 columns]
This uses two passes through. It could be shortened if you added an else
conditional that concatenated the rows that will be kept unchanged. However, I find this more readable, and since we're using itertuples
to go over the rows, the cost here is linear and we're only forming each tuple as needed (not a big list of tuples for all rows simultaneously).
Similarly, you could pop a row inside of the if
statement and concatenate the two new rows in its place back onto the original data object df
, so that you don't incur the memory cost of creating keeper_rows
. It's just not usually worth it to make these kinds of optimizations for a task like this unless the DataFrame is gigantic.
keeper_rows = df.ix[[i for i,x in enumerate(df.itertuples()) if 'GE' in x[0:2]]]
for row_as_tuple in df.itertuples():
from_other, to_other, val = row_as_tuple
if "GE" not in (from_other, to_other):
new_rows = {"From":["GE", from_other],
"To" :[to_other, "GE"],
"Val" :[val, val]}
keeper_rows = pandas.concat([keeper_rows, pandas.DataFrame(new_rows)],
ignore_index=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With