Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Concat increases number of rows

I'm concatenating two dataframes, so I want to one dataframe is located to another. But first I did some transformation to initial dataframe:

scaler = MinMaxScaler() 
real_data = pd.DataFrame(scaler.fit_transform(df[real_columns]), columns = real_columns)

And then concatenate:

categorial_data  = pd.get_dummies(df[categor_columns], prefix_sep= '__')
train = pd.concat([real_data, categorial_data], axis=1, ignore_index=True)

I dont know why, but number of rows increased:

print(df.shape, real_data.shape, categorial_data.shape, train.shape)
(1700645, 23) (1700645, 16) (1700645, 130) (1703915, 146)

What happened and how fix the problem?

As you can see number of columns for train equals to sum of columns real_data and categorial_data

like image 916
Rocketq Avatar asked May 16 '18 10:05

Rocketq


People also ask

Is concat faster than append pandas?

Time is of the essence; which one is faster? In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version.

What is difference between pandas concat and merge?

merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.

Which is faster concat or append?

Concat function will do a single operation to finish the job, which makes it faster than append(). As append will add rows one by one, if the dataframe is significantly very small, then append operation is fine as only a few appends will be done for the number of rows in second dataframe.

Is there a row limit for pandas?

max_rows represents the maximum number of rows that pandas will display while displaying a data frame. The default value of max_rows is 10. If set to 'None' then it means all rows of the data frame.


1 Answers

The problem is that sometimes when you perform several operations on a single dataframe object, the index persists in the memory. So using df.reset_index() will solve your problem.

like image 53
saket ram Avatar answered Oct 12 '22 23:10

saket ram