Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: expanding DataFrame by number of observations in column

Stata has the function expand which adds rows to a database corresponding to values in a particular column. For example:

I have:

df = pd.DataFrame({"A":[1, 2, 3], 
                   "B":[3,4,5]})

   A  B
0  1  3
1  2  4
2  3  5

What I need:

df2 = pd.DataFrame({"A":[1, 2, 3, 2, 3, 3], 
                    "B":[3,4,5, 4, 5, 5]})

   A  B
0  1  3
1  2  4
2  3  5
3  2  4
4  3  5
6  3  5

The value in df.loc[0,'A'] is 1, so no additional row is added to the end of the DataFrame, since B=3 is only supposed to occur once.

The value in df.loc[1,'A'] is 2, so one observation was added to the end of the DataFrame, bringing the total occurrences of B=4 to 2.

The value in df.loc[2,'A'] is 3, so two observations were added to the end of the DataFrame, bringing the total occurrences of B=5 to 3.

I've scoured prior questions for something to get me started, but no luck. Any help is appreciated.

like image 484
measure_theory Avatar asked Jul 28 '17 18:07

measure_theory


1 Answers

There are a number of possibilities, all built around np.repeat:

def using_reindex(df):
    return df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)

def using_dictcomp(df):
    return  pd.DataFrame({col:np.repeat(df[col].values, df['A'], axis=0) 
                          for col in df})

def using_df_values(df):
    return pd.DataFrame(np.repeat(df.values, df['A'], axis=0), columns=df.columns)

def using_loc(df):
    return df.loc[np.repeat(df.index.values, df['A'])].reset_index(drop=True)

For example,

In [219]: df = pd.DataFrame({"A":[1, 2, 3], "B":[3,4,5]})
In [220]: df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
Out[220]: 
   A  B
0  1  3
1  2  4
2  2  4
3  3  5
4  3  5
5  3  5

Here is a benchmark on a 1000-row DataFrame; the result being a roughly 500K-row DataFrame:

In [208]: df = make_dataframe(1000)

In [210]: %timeit using_dictcomp(df)
10 loops, best of 3: 23.6 ms per loop

In [218]: %timeit using_reindex(df)
10 loops, best of 3: 35.8 ms per loop

In [211]: %timeit using_df_values(df)
10 loops, best of 3: 31.3 ms per loop

In [212]: %timeit using_loc(df)
1 loop, best of 3: 275 ms per loop

This is the code I used to generate df:

import numpy as np
import pandas as pd

def make_dataframe(nrows=100):
    df = pd.DataFrame(
        {'A': np.arange(nrows),
         'float': np.random.randn(nrows),
         'str': np.random.choice('Lorem ipsum dolor sit'.split(), size=nrows),
         'datetime64': pd.date_range('20000101', periods=nrows)},
        index=pd.date_range('20000101', periods=nrows))
    return df

df = make_dataframe(1000)

If there are only a few columns, using_dictcomp is the fastest. But note that using_dictcomp assumes df has unique column names. The dict comprehension in using_dictcomp won't repeat duplicated column names. The other alternatives will work with repeated column names, however.

Both using_reindex and using_loc assume df has a unique index.


using_reindex came from cᴏʟᴅsᴘᴇᴇᴅ's using_loc, in an (unfortunately) now deleted post. cᴏʟᴅsᴘᴇᴇᴅ showed it wasn't necessary to manually repeat all the values -- you only need to repeat the index and then let df.loc (or df.reindex) repeat all the rows for you. It also avoids accessing df.values which could generate an intermediate NumPy array of object dtype if df contains columns of multiple dtypes.

like image 163
unutbu Avatar answered Oct 07 '22 01:10

unutbu