Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas merge multiple csv files

I have around 600 csv file datasets, all have the very same column names [‘DateTime’, ‘Actual’, ‘Consensus’, ‘Previous’, ‘Revised’], all economic indicators and all-time series data sets.

the aim is to merge them all together in one csv file.

With ‘DateTime’ as an index.

The way I wanted this file to indexed in is the time line way which means let’s say the first event in the first csv dated in 12/18/2017 10:00:00 and first event in the second csv dated in 12/29/2017 09:00:00 and first event in the third csv dated in 12/20/2017 09:00:00.

So, I want to index them the later first and the newer after it, etc. despite the source csv it originally from.

I tried to merge just 3 of them as an experiment and the problem is the ‘DateTime’ because it prints the 3 of them together like this ('12/18/2017 10:00:00', '12/29/2017 09:00:00', '12/20/2017 09:00:00') Here is the code:

import pandas as pd


df1 = pd.read_csv("E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv")
df2 = pd.read_csv("E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv")
df3 = pd.read_csv("E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv")

df = pd.concat([df1, df2, df3], axis=1, join='inner')
df.set_index('DateTime', inplace=True)

print(df.head())
df.to_csv('df.csv')
like image 451
Sayed Gouda Avatar asked Jan 01 '18 15:01

Sayed Gouda


People also ask

How do I merge multiple CSV files in pandas?

To merge all CSV files, use the GLOB module. The os. path. join() method is used inside the concat() to merge the CSV files together.

How do I merge two datasets in pandas?

Pandas DataFrame merge() function is used to merge two DataFrame objects with a database-style join operation. The joining is performed on columns or indexes. If the joining is done on columns, indexes are ignored. This function returns a new DataFrame and the source DataFrame objects are unchanged.


2 Answers

Consider using read_csv() args, index_col and parse_dates, to create indices during import and format as datetime. Then run your needed horizontal merge. Below assumes date is in first column of csv. And at the end use sort_index() on final dataframe to sort the datetimes.

df1 = pd.read_csv(r"E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])
df3 = pd.read_csv(r"E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv",
                  index_col=[0], parse_dates=[0])

finaldf = pd.concat([df1, df2, df3], axis=1, join='inner').sort_index()

And for DRY-er approach especially across the hundreds of csv files, use a list comprehension

import os
...
os.chdir('E:\\Business\\Economic Indicators')

dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
        for f in os.listdir(os.getcwd()) if f.endswith('csv')]

finaldf = pd.concat(dfs, axis=1, join='inner').sort_index()
like image 118
Parfait Avatar answered Sep 22 '22 18:09

Parfait


You're trying to build one large dataframe out of the rows of many dataframes who all have the same column names. axis should be 0 (the default), not 1. Also you don't need to specify a type of join. This will have no effect since the column names are the same for each dataframe.

df = pd.concat([df1, df2, df3])

should be enough in order to concatenate the datasets.

(see https://pandas.pydata.org/pandas-docs/stable/merging.html )

Your call to set_index to define an index using the values in the DateTime column should then work.

like image 36
John Smith Optional Avatar answered Sep 21 '22 18:09

John Smith Optional