I have a large data-frame about 160k rows by 24 columns. I also have a pandas series of length 26 that I would like to add row-wise to my data-frame to make a final data-frame that is 160k rows by 50 columns, but my code is painfully slow.
Specifically this is slow, but it works:
final = df.apply(lambda x: x.append(my_series), axis=1)
Which yields the correct final shape:
Out[49]: (163008, 50)
Where, df.shape
is Out[48]: (163008, 24)
and my_series.shape
is Out[47]: (26,)
This method performs fine for smaller dataframes in the <50k rows range, but clearly it is not ideal.
Did a few tests using %timeit
with a test dataframe and a test series, with the following sizes:
test_df.shape
Out[18]: (156108, 24)
test_series.shape
Out[20]: (26,)
Where both the data-frame and the series contain a mix of strings, floats, integers, objects, etc.
Accepted Solution Using Numpy:
%timeit test_df.join(pd.DataFrame(np.tile(test_series.values, len(test_df.index)).reshape(-1, len(attributes)), index=test_df.index, columns=test_series.index))
10 loops, best of 3: 220 ms per loop
Using assign:
I keep receiving ValueError: Length of values does not match length of index
with my test series though when I use the simpler series provided it works, not sure what is going on here......
Using Custom Function by @Divakar
%timeit rowwise_concat_df_series(test_df, test_series)
1 loop, best of 3: 424 ms per loop
append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. Parameters: other : DataFrame or Series/dict-like object, or list of these.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
Rows represents the records/ tuples and columns refers to the attributes. We can create the DataFrame by using pandas.DataFrame () method. We can also create a DataFrame using dictionary by skipping columns and indices.
Iterrows According to the official documentation, iterrows () iterates "over the rows of a Pandas DataFrame as (index, Series) pairs". It converts each row into a Series object, which causes two problems:
Single row in the DataFrame into a Series (1) Convert a Single DataFrame Column into a Series To start with a simple example, let’s create a DataFrame with a single column: import pandas as pd data = {'First_Name': ['Jeff','Tina','Ben','Maria','Rob']} df = pd.DataFrame (data, columns = ['First_Name']) print (df) print (type (df))
You can then use df.squeeze () to convert the DataFrame into a Series: The DataFrame will now get converted into a Series: What if you have a DataFrame with multiple columns, and you’d like to convert a specific column into a Series?
We can use DataFrame.assign() method:
Setup:
In [37]: df = pd.DataFrame(np.random.randn(5, 3), columns=['A','B','C'])
In [38]: my_series = pd.Series([10,11,12], index=['X','Y','Z'])
In [39]: df
Out[39]:
A B C
0 1.129066 0.975453 -0.737507
1 -0.347736 -1.469583 -0.727113
2 1.158480 0.933604 -1.219617
3 -0.689830 3.063868 0.345233
4 0.184248 0.920349 -0.852213
In [40]: my_series
Out[40]:
X 10
Y 11
Z 12
dtype: int64
Solution:
In [41]: df = df.assign(**my_series)
Result:
In [42]: df
Out[42]:
A B C X Y Z
0 1.129066 0.975453 -0.737507 10 11 12
1 -0.347736 -1.469583 -0.727113 10 11 12
2 1.158480 0.933604 -1.219617 10 11 12
3 -0.689830 3.063868 0.345233 10 11 12
4 0.184248 0.920349 -0.852213 10 11 12
NOTE: the series should have string index elements.
PS **variable
explained
I think you need numpy.tile
with numpy.ndarray.reshape
for new df
by Series
values and last join
:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
s = pd.Series([1,5,6,7], index=list('abcd'))
print (s)
a 1
b 5
c 6
d 7
dtype: int64
df1 = pd.DataFrame(np.tile(s.values, len(df.index)).reshape(-1,len(s)),
index=df.index,
columns=s.index)
print (df1)
a b c d
0 1 5 6 7
1 1 5 6 7
2 1 5 6 7
3 1 5 6 7
4 1 5 6 7
5 1 5 6 7
df = df.join(df1)
print (df)
A B C D E F a b c d
0 a 4 7 1 5 a 1 5 6 7
1 b 5 8 3 3 a 1 5 6 7
2 c 4 9 5 6 a 1 5 6 7
3 d 5 4 7 9 b 1 5 6 7
4 e 5 2 1 2 b 1 5 6 7
5 f 4 3 0 4 b 1 5 6 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With