I'm concatenating two dataframes, so I want to one dataframe is located to another. But first I did some transformation to initial dataframe:
scaler = MinMaxScaler()
real_data = pd.DataFrame(scaler.fit_transform(df[real_columns]), columns = real_columns)
And then concatenate:
categorial_data = pd.get_dummies(df[categor_columns], prefix_sep= '__')
train = pd.concat([real_data, categorial_data], axis=1, ignore_index=True)
I dont know why, but number of rows increased:
print(df.shape, real_data.shape, categorial_data.shape, train.shape)
(1700645, 23) (1700645, 16) (1700645, 130) (1703915, 146)
What happened and how fix the problem?
As you can see number of columns for train equals to sum of columns real_data and categorial_data
Time is of the essence; which one is faster? In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version.
merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.
Concat function will do a single operation to finish the job, which makes it faster than append(). As append will add rows one by one, if the dataframe is significantly very small, then append operation is fine as only a few appends will be done for the number of rows in second dataframe.
max_rows represents the maximum number of rows that pandas will display while displaying a data frame. The default value of max_rows is 10. If set to 'None' then it means all rows of the data frame.
The problem is that sometimes when you perform several operations on a single dataframe object, the index persists in the memory. So using df.reset_index() will solve your problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With