I have two Dataframes one large one with a lot of missing values and a second one with data to fill the missing data in the first one.
Dataframe examples:
In[34]:
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], 'B': [1, 0, 1, 1, 0, 0]})
df = pd.DataFrame({'A': [0, 1, 2, 3, 4, 5], 'B1': [1, np.nan, np.nan, 8, 9, 1],'B2':[1, np.nan, np.nan, 7, 6, 1], 'B3':[1, np.nan, np.nan, 8, 7, 1] })
df=df.set_index(['A'])
df2=df2.set_index(['A'])
In[35]:
df
Out[35]:
B1 B2 B3
A
0 1 1 1
1 NaN NaN NaN
2 NaN NaN NaN
3 8 7 8
4 9 6 7
5 1 1 1
In[36]:
df2
Out[36]:
B
A
1 1
1 0
1 1
2 1
2 0
2 0
so what I want to do is fill up df using the data from df2 also taking into account that B1 is not B2 when coming across a second instance in df2. See below the desired output:
In[38]:
df
Out[38]:
B1 B2 B3
A
0 1 1 1
1 1 0 1
2 1 0 0
3 8 7 8
4 9 6 7
5 1 1 1
The NaNs in B1, B2 and B3 for 1 and 2 have been filled with the data from df2. 1 0 1 for index 1 and 1 0 0 for index 2. See below my inefficient for loop implementation:
In[37]:
count=1
seen=[]
for t in range(0, len(df2)):
if df2.index[t] not in seen:
count=1
seen.append(df2.index[t])
else:
count=count+1
tofill=pd.DataFrame(df2.iloc[t]).transpose()
tofill_dict={"B"+str(count):tofill.B}
df=df.fillna(value=tofill_dict)
This works, however when the dataset gets larger it can take a significant amount of time. So my question is if there is a way to do this faster? I have heard vectorization could work, how would you implement this? Are there any other ways to do this faster?
First you cannot reset index of df2
.
You can use try groupby
- each group is transposing with T
and then fillna
df
by values of df2
:
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], 'B': [1, 0, 1, 1, 0, 0]})
df = pd.DataFrame({'A': [0, 1, 2, 3, 4, 5], 'B1': [1, np.nan, np.nan, 8, 9, 1],'B2':[1, np.nan, np.nan, 7, 6, 1], 'B3':[1, np.nan, np.nan, 8, 7, 1] })
df=df.set_index(['A'])
df2=df2.set_index(['A'])
print df
B1 B2 B3
A
0 1 1 1
1 NaN NaN NaN
2 NaN NaN NaN
3 8 7 8
4 9 6 7
5 1 1 1
print df2
A B
0 1 1
1 1 0
2 1 1
3 2 1
4 2 0
5 2 0
df2 = df2.groupby(df2.index).apply(lambda x: x.B.reset_index(drop=True).T)
df2.columns = df.columns
print df2
B1 B2 B3
A
1 1 0 1
2 1 0 0
df = df.fillna(df2)
print df
B1 B2 B3
A
0 1 1 1
1 1 0 1
2 1 0 0
3 8 7 8
4 9 6 7
5 1 1 1
Maybe if df = df.fillna(df2)
doesn't work, can be use df = df.combine_first(df2)
. It depends on index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With