Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add Pandas Series as rows to existing dataframe efficiently

Tags:

pandas

numpy

I have a large data-frame about 160k rows by 24 columns. I also have a pandas series of length 26 that I would like to add row-wise to my data-frame to make a final data-frame that is 160k rows by 50 columns, but my code is painfully slow.

Specifically this is slow, but it works: final = df.apply(lambda x: x.append(my_series), axis=1)

Which yields the correct final shape: Out[49]: (163008, 50)

Where, df.shape is Out[48]: (163008, 24) and my_series.shape is Out[47]: (26,)

This method performs fine for smaller dataframes in the <50k rows range, but clearly it is not ideal.

Update: Added Benchmarks For the Solutions Below

Did a few tests using %timeit with a test dataframe and a test series, with the following sizes: test_df.shape

Out[18]: (156108, 24)

test_series.shape

Out[20]: (26,)

Where both the data-frame and the series contain a mix of strings, floats, integers, objects, etc.

Accepted Solution Using Numpy:

%timeit test_df.join(pd.DataFrame(np.tile(test_series.values, len(test_df.index)).reshape(-1, len(attributes)), index=test_df.index, columns=test_series.index))

10 loops, best of 3: 220 ms per loop

Using assign: I keep receiving ValueError: Length of values does not match length of index with my test series though when I use the simpler series provided it works, not sure what is going on here......

Using Custom Function by @Divakar

%timeit rowwise_concat_df_series(test_df, test_series)

1 loop, best of 3: 424 ms per loop

like image 298
guy Avatar asked Jul 19 '17 12:07

guy


People also ask

How do I add rows to an existing DataFrame?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. Parameters: other : DataFrame or Series/dict-like object, or list of these.

What is the most efficient way to loop through DataFrames with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

How to create a Dataframe from a row in pandas?

Rows represents the records/ tuples and columns refers to the attributes. We can create the DataFrame by using pandas.DataFrame () method. We can also create a DataFrame using dictionary by skipping columns and indices.

What is iterrows in pandas Dataframe?

Iterrows According to the official documentation, iterrows () iterates "over the rows of a Pandas DataFrame as (index, Series) pairs". It converts each row into a Series object, which causes two problems:

How do I convert a single row in a Dataframe to series?

Single row in the DataFrame into a Series (1) Convert a Single DataFrame Column into a Series To start with a simple example, let’s create a DataFrame with a single column: import pandas as pd data = {'First_Name': ['Jeff','Tina','Ben','Maria','Rob']} df = pd.DataFrame (data, columns = ['First_Name']) print (df) print (type (df))

How to convert a Dataframe to a series in Python?

You can then use df.squeeze () to convert the DataFrame into a Series: The DataFrame will now get converted into a Series: What if you have a DataFrame with multiple columns, and you’d like to convert a specific column into a Series?


2 Answers

We can use DataFrame.assign() method:

Setup:

In [37]: df = pd.DataFrame(np.random.randn(5, 3), columns=['A','B','C'])

In [38]: my_series = pd.Series([10,11,12], index=['X','Y','Z'])

In [39]: df
Out[39]:
          A         B         C
0  1.129066  0.975453 -0.737507
1 -0.347736 -1.469583 -0.727113
2  1.158480  0.933604 -1.219617
3 -0.689830  3.063868  0.345233
4  0.184248  0.920349 -0.852213

In [40]: my_series
Out[40]:
X    10
Y    11
Z    12
dtype: int64

Solution:

In [41]: df = df.assign(**my_series)

Result:

In [42]: df
Out[42]:
          A         B         C   X   Y   Z
0  1.129066  0.975453 -0.737507  10  11  12
1 -0.347736 -1.469583 -0.727113  10  11  12
2  1.158480  0.933604 -1.219617  10  11  12
3 -0.689830  3.063868  0.345233  10  11  12
4  0.184248  0.920349 -0.852213  10  11  12

NOTE: the series should have string index elements.

PS **variable explained

like image 58
MaxU - stop WAR against UA Avatar answered Oct 10 '22 17:10

MaxU - stop WAR against UA


I think you need numpy.tile with numpy.ndarray.reshape for new df by Series values and last join:

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')})

print (df)
   A  B  C  D  E  F
0  a  4  7  1  5  a
1  b  5  8  3  3  a
2  c  4  9  5  6  a
3  d  5  4  7  9  b
4  e  5  2  1  2  b
5  f  4  3  0  4  b

s = pd.Series([1,5,6,7], index=list('abcd'))
print (s)
a    1
b    5
c    6
d    7
dtype: int64

df1 = pd.DataFrame(np.tile(s.values, len(df.index)).reshape(-1,len(s)), 
                   index=df.index, 
                   columns=s.index)
print (df1)
   a  b  c  d
0  1  5  6  7
1  1  5  6  7
2  1  5  6  7
3  1  5  6  7
4  1  5  6  7
5  1  5  6  7

df = df.join(df1)
print (df)
   A  B  C  D  E  F  a  b  c  d
0  a  4  7  1  5  a  1  5  6  7
1  b  5  8  3  3  a  1  5  6  7
2  c  4  9  5  6  a  1  5  6  7
3  d  5  4  7  9  b  1  5  6  7
4  e  5  2  1  2  b  1  5  6  7
5  f  4  3  0  4  b  1  5  6  7
like image 25
jezrael Avatar answered Oct 10 '22 18:10

jezrael