Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I concat multiple dataframes in Python? [duplicate]

I have multiple (more than 100) dataframes. How can I concat all of them?

The problem is, that I have too many dataframes, that I can not write them manually in a list, like this:

>>> cluster_1 = pd.DataFrame([['a', 1], ['b', 2]], ...                    columns=['letter  ', 'number'])   >>> cluster_1   letter  number 0      a       1 1      b       2   >>> cluster_2 = pd.DataFrame([['c', 3], ['d', 4]], ...                    columns=['letter', 'number'])   >>> cluster_2   letter  number 0      c       3 1      d       4   >>> pd.concat([cluster_1, cluster_2])   letter number 0      a       1 1      b       2 0      c       3 1      d       4 

The names of my N dataframes are cluster_1, cluster_2, cluster_3,..., cluster_N. The number N can be very high.

How can I concat N dataframes?

like image 726
PParker Avatar asked Dec 21 '18 00:12

PParker


People also ask

Does concat remove duplicates Python?

To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.

How do I combine multiple DataFrames into one in Python?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

Can we merge 3 DataFrames in Python?

You can use the same approach to merge more than three DataFrames. Alternatively, you can also use DataFrame. merge() to join multiple pandas DataFrames.


2 Answers

I think you can just put it into a list, and then concat the list. In Pandas, the chunk function kind of already does this. I personally do this when using the chunk function in pandas.

pdList = [df1, df2, ...]  # List of your dataframes new_df = pd.concat(pdList) 

To create the pdList automatically assuming your dfs always start with "cluster".

pdList = [] pdList.extend(value for name, value in locals().items() if name.startswith('cluster_')) 
like image 84
Rui Nian Avatar answered Oct 21 '22 21:10

Rui Nian


Generally it goes like:

frames = [df1, df2, df3] result = pd.concat(frames) 

Note: It will reset the index automatically. Read more details on different types of merging here.

For a large number of data frames: If you have hundreds of data frames, depending one if you have in on disk or in memory you can still create a list ("frames" in the code snippet) using a for a loop. If you have it in the disk, it can be easily done just saving all the df's in one single folder then reading all the files from that folder.

If you are generating the df's in memory, maybe try saving it in .pkl first.

like image 39
zafrin Avatar answered Oct 21 '22 22:10

zafrin