I have a large data-frame about 160k rows by 24 columns. I also have a pandas series of length 26 that I would like to add row-wise to my data-frame to make a final data-frame that is 160k rows by 50 columns, but my code is painfully slow. Specifically this is slow, but it works: <code>final = df.apply(lambda x: x.append(my_series), axis=1)</code> Which yields the correct final shape: <code>Out[49]: (163008, 50)</code> Where, <code>df.shape</code> is <code>Out[48]: (163008, 24)</code> and <code>my_series.shape</code> is <code>Out[47]: (26,)</code> This method performs fine for smaller dataframes in the <50k rows range, but clearly it is not ideal. <h3>Update: Added Benchmarks For the Solutions Below</h3> Did a few tests using <code>%timeit</code> with a test dataframe and a test series, with the following sizes: <code>test_df.shape</code> <code>Out[18]: (156108, 24)</code> <code>test_series.shape</code> <code>Out[20]: (26,)</code> Where both the data-frame and the series contain a mix of strings, floats, integers, objects, etc. Accepted Solution Using Numpy: <code>%timeit test_df.join(pd.DataFrame(np.tile(test_series.values, len(test_df.index)).reshape(-1, len(attributes)), index=test_df.index, columns=test_series.index))</code> <code>10 loops, best of 3: 220 ms per loop</code> Using assign: I keep receiving <code>ValueError: Length of values does not match length of index</code> with my test series though when I use the simpler series provided it works, not sure what is going on here...... Using Custom Function by @Divakar <code>%timeit rowwise_concat_df_series(test_df, test_series)</code> <code>1 loop, best of 3: 424 ms per loop</code>

We can use DataFrame.assign() method: Setup: <pre class="prettyprint"><code>In [37]: df = pd.DataFrame(np.random.randn(5, 3), columns=['A','B','C']) In [38]: my_series = pd.Series([10,11,12], index=['X','Y','Z']) In [39]: df Out[39]: A B C 0 1.129066 0.975453 -0.737507 1 -0.347736 -1.469583 -0.727113 2 1.158480 0.933604 -1.219617 3 -0.689830 3.063868 0.345233 4 0.184248 0.920349 -0.852213 In [40]: my_series Out[40]: X 10 Y 11 Z 12 dtype: int64 </code></pre> Solution: <pre class="prettyprint"><code>In [41]: df = df.assign(**my_series) </code></pre> Result: <pre class="prettyprint"><code>In [42]: df Out[42]: A B C X Y Z 0 1.129066 0.975453 -0.737507 10 11 12 1 -0.347736 -1.469583 -0.727113 10 11 12 2 1.158480 0.933604 -1.219617 10 11 12 3 -0.689830 3.063868 0.345233 10 11 12 4 0.184248 0.920349 -0.852213 10 11 12 </code></pre> NOTE: the series should have string index elements. PS <code>**variable</code> explained

Add Pandas Series as rows to existing dataframe efficiently

Q: How do I add rows to an existing DataFrame?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. Parameters: other : DataFrame or Series/dict-like object, or list of these.

Q: What is the most efficient way to loop through DataFrames with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

Q: How to create a Dataframe from a row in pandas?

Rows represents the records/ tuples and columns refers to the attributes. We can create the DataFrame by using pandas.DataFrame () method. We can also create a DataFrame using dictionary by skipping columns and indices.

Q: What is iterrows in pandas Dataframe?

Iterrows According to the official documentation, iterrows () iterates "over the rows of a Pandas DataFrame as (index, Series) pairs". It converts each row into a Series object, which causes two problems:

Q: How do I convert a single row in a Dataframe to series?

Single row in the DataFrame into a Series (1) Convert a Single DataFrame Column into a Series To start with a simple example, let’s create a DataFrame with a single column: import pandas as pd data = {'First_Name': ['Jeff','Tina','Ben','Maria','Rob']} df = pd.DataFrame (data, columns = ['First_Name']) print (df) print (type (df))

Q: How to convert a Dataframe to a series in Python?

You can then use df.squeeze () to convert the DataFrame into a Series: The DataFrame will now get converted into a Series: What if you have a DataFrame with multiple columns, and you’d like to convert a specific column into a Series?

Tags:

pandas

numpy

I have a large data-frame about 160k rows by 24 columns. I also have a pandas series of length 26 that I would like to add row-wise to my data-frame to make a final data-frame that is 160k rows by 50 columns, but my code is painfully slow.

Specifically this is slow, but it works: final = df.apply(lambda x: x.append(my_series), axis=1)

Which yields the correct final shape: Out[49]: (163008, 50)

Where, df.shape is Out[48]: (163008, 24) and my_series.shape is Out[47]: (26,)

This method performs fine for smaller dataframes in the <50k rows range, but clearly it is not ideal.

Update: Added Benchmarks For the Solutions Below

Did a few tests using %timeit with a test dataframe and a test series, with the following sizes: test_df.shape

Out[18]: (156108, 24)

test_series.shape

Out[20]: (26,)

Where both the data-frame and the series contain a mix of strings, floats, integers, objects, etc.

Accepted Solution Using Numpy:

%timeit test_df.join(pd.DataFrame(np.tile(test_series.values, len(test_df.index)).reshape(-1, len(attributes)), index=test_df.index, columns=test_series.index))

10 loops, best of 3: 220 ms per loop

Using assign: I keep receiving ValueError: Length of values does not match length of index with my test series though when I use the simpler series provided it works, not sure what is going on here......

Using Custom Function by @Divakar

%timeit rowwise_concat_df_series(test_df, test_series)

1 loop, best of 3: 424 ms per loop

298

asked Jul 19 '17 12:07

guy

2 Answers

We can use DataFrame.assign() method:

Setup:

In [37]: df = pd.DataFrame(np.random.randn(5, 3), columns=['A','B','C'])

In [38]: my_series = pd.Series([10,11,12], index=['X','Y','Z'])

In [39]: df
Out[39]:
          A         B         C
0  1.129066  0.975453 -0.737507
1 -0.347736 -1.469583 -0.727113
2  1.158480  0.933604 -1.219617
3 -0.689830  3.063868  0.345233
4  0.184248  0.920349 -0.852213

In [40]: my_series
Out[40]:
X    10
Y    11
Z    12
dtype: int64

Solution:

In [41]: df = df.assign(**my_series)

Result:

In [42]: df
Out[42]:
          A         B         C   X   Y   Z
0  1.129066  0.975453 -0.737507  10  11  12
1 -0.347736 -1.469583 -0.727113  10  11  12
2  1.158480  0.933604 -1.219617  10  11  12
3 -0.689830  3.063868  0.345233  10  11  12
4  0.184248  0.920349 -0.852213  10  11  12

NOTE: the series should have string index elements.

PS **variable explained

answered Oct 10 '22 17:10

MaxU - stop WAR against UA

I think you need numpy.tile with numpy.ndarray.reshape for new df by Series values and last join:

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')})

print (df)
   A  B  C  D  E  F
0  a  4  7  1  5  a
1  b  5  8  3  3  a
2  c  4  9  5  6  a
3  d  5  4  7  9  b
4  e  5  2  1  2  b
5  f  4  3  0  4  b

s = pd.Series([1,5,6,7], index=list('abcd'))
print (s)
a    1
b    5
c    6
d    7
dtype: int64

df1 = pd.DataFrame(np.tile(s.values, len(df.index)).reshape(-1,len(s)), 
                   index=df.index, 
                   columns=s.index)
print (df1)
   a  b  c  d
0  1  5  6  7
1  1  5  6  7
2  1  5  6  7
3  1  5  6  7
4  1  5  6  7
5  1  5  6  7

df = df.join(df1)
print (df)
   A  B  C  D  E  F  a  b  c  d
0  a  4  7  1  5  a  1  5  6  7
1  b  5  8  3  3  a  1  5  6  7
2  c  4  9  5  6  a  1  5  6  7
3  d  5  4  7  9  b  1  5  6  7
4  e  5  2  1  2  b  1  5  6  7
5  f  4  3  0  4  b  1  5  6  7

answered Oct 10 '22 18:10

jezrael

Related questions
                            
                                Map to List error: Series object not callable
                            
                                Python/Pandas: How do I convert from datetime64[ns] to datetime
                            
                                How to sum and to mean one DataFrame to create another DataFrame
                            
                                Compute the running (cumulative) maximum for a series in pandas
                            
                                how to initialize multiple columns to existing pandas DataFrame
                            
                                vlookup between 2 Pandas dataframes
                            
                                Plotting of pandas DataFrame and xaxis as Timestamp produces empty plot
                            
                                Replace 0 with blank in dataframe Python pandas
                            
                                Merging multiindex dataframe in pandas
                            
                                pandas : merge two columns, every other row
                            
                                Python Pandas Wide to Long Format Change with Column Titles Spliting
                            
                                Pandas parse non-english string dates
                            
                                Using Pandas to sample DataFrame using a specific column's weight
                            
                                Python Pandas Dataframe assignment
                            
                                Sorting Pandas DataFrames
                            
                                move from pandas to dask to utilize all local cpu cores
                            
                                Pandas: how to use between_time with milliseconds?
                            
                                How to rename the index of a Dask Dataframe
                            
                                Resampling a pandas dataframe with multi-index containing timeseries
                            
                                Scatter plot on large amount of data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With