Stata has the function expand which adds rows to a database corresponding to values in a particular column. For example:
I have:
df = pd.DataFrame({"A":[1, 2, 3],
"B":[3,4,5]})
A B
0 1 3
1 2 4
2 3 5
What I need:
df2 = pd.DataFrame({"A":[1, 2, 3, 2, 3, 3],
"B":[3,4,5, 4, 5, 5]})
A B
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
6 3 5
The value in df.loc[0,'A'] is 1, so no additional row is added to the end of the DataFrame, since B=3 is only supposed to occur once.
The value in df.loc[1,'A'] is 2, so one observation was added to the end of the DataFrame, bringing the total occurrences of B=4 to 2.
The value in df.loc[2,'A'] is 3, so two observations were added to the end of the DataFrame, bringing the total occurrences of B=5 to 3.
I've scoured prior questions for something to get me started, but no luck. Any help is appreciated.
There are a number of possibilities, all built around np.repeat
:
def using_reindex(df):
return df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
def using_dictcomp(df):
return pd.DataFrame({col:np.repeat(df[col].values, df['A'], axis=0)
for col in df})
def using_df_values(df):
return pd.DataFrame(np.repeat(df.values, df['A'], axis=0), columns=df.columns)
def using_loc(df):
return df.loc[np.repeat(df.index.values, df['A'])].reset_index(drop=True)
For example,
In [219]: df = pd.DataFrame({"A":[1, 2, 3], "B":[3,4,5]})
In [220]: df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
Out[220]:
A B
0 1 3
1 2 4
2 2 4
3 3 5
4 3 5
5 3 5
Here is a benchmark on a 1000-row DataFrame; the result being a roughly 500K-row DataFrame:
In [208]: df = make_dataframe(1000)
In [210]: %timeit using_dictcomp(df)
10 loops, best of 3: 23.6 ms per loop
In [218]: %timeit using_reindex(df)
10 loops, best of 3: 35.8 ms per loop
In [211]: %timeit using_df_values(df)
10 loops, best of 3: 31.3 ms per loop
In [212]: %timeit using_loc(df)
1 loop, best of 3: 275 ms per loop
This is the code I used to generate df
:
import numpy as np
import pandas as pd
def make_dataframe(nrows=100):
df = pd.DataFrame(
{'A': np.arange(nrows),
'float': np.random.randn(nrows),
'str': np.random.choice('Lorem ipsum dolor sit'.split(), size=nrows),
'datetime64': pd.date_range('20000101', periods=nrows)},
index=pd.date_range('20000101', periods=nrows))
return df
df = make_dataframe(1000)
If there are only a few columns, using_dictcomp
is the fastest. But note that using_dictcomp
assumes df
has unique column names. The dict comprehension in using_dictcomp
won't repeat duplicated column names. The other alternatives will work with repeated column names, however.
Both using_reindex
and using_loc
assume df
has a unique index.
using_reindex
came from cᴏʟᴅsᴘᴇᴇᴅ's using_loc
, in an (unfortunately) now
deleted post. cᴏʟᴅsᴘᴇᴇᴅ showed it wasn't necessary to manually repeat all the values -- you only need to repeat the index and then let df.loc
(or df.reindex
) repeat all the rows for you. It also avoids accessing df.values
which could generate an intermediate NumPy array of object
dtype if df
contains columns of multiple dtypes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With