Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading multiple csv files into a Pandas Data Frame

I am attempting to read multiple CSV files into a Pandas data frame. The CSVs aren't comma separated - the fields are delimited by a semicolon ";".

I based my code on the answers here.

My data is all in a specific subdirectory: /data/luftdaten/5331

This is what I run:

import glob
import pandas as pd

path =r'data/luftdaten/5331' # use your path

filenames = glob.glob(path + "/*.csv")
count_files = 0
dfs = []
for filename in filenames:
    if count_files ==0:
        dfs.append(pd.read_csv(filename, sep=";")) 
        count_files += 1
    else:
        dfs.append(pd.read_csv(filename, sep=";", skiprows=[0]))
        count_files +=1

big_frame = pd.concat(dfs, ignore_index=True)

I use count_files to monitor if it is the first CSV - in which case I import the headers. Otherwise, it skips the headers.

The code executes OK.

If I run it with a single file in that directory, everything is fine:

big_frame.info()

Output:

RangeIndex: 146 entries, 0 to 145
Data columns (**total 12 column**s):
sensor_id      146 non-null int64
sensor_type    146 non-null object
etc......

If I run it with 2 or more files in the directory things go wrong from the start.

Output with 4 files:

RangeIndex: 1893 entries, 0 to 1892
Data columns (total **33 columns**):
-2.077                 1164 non-null float64
-2.130                 145 non-null float64
2.40                   145 non-null float64

Running big_frame.head() on the single CSV version gives this, with the correct column names:

output from importing single CSV

While running the same with four files imported gives me this:

output from importing 4 CSV files (cropped right)

Is there anything obvious that I am doing which is causing not only the number of rows to grow but the columns too?

Your guidance would be gratefully appreciated!

like image 867
Watty62 Avatar asked Jan 02 '23 09:01

Watty62


1 Answers

The reason why it's currently not working is when you do skiprows=[0] for each file after your first, that new dataframe has its second (index 1) row used as the column titles. Hence, when the frames are concatenated there are lots and lots of column headers that don't match. If you remove the skiprows=[0] it should work.

Assuming all of your files have the same header (or you're okay with NaN when they differ), you should be able to do this in a one-liner:

big_frame = pd.concat([pd.read_csv(f, sep=';') for f in glob.glob(path + "/*.csv")],
                      ignore_index=True)
like image 172
asongtoruin Avatar answered Jan 05 '23 17:01

asongtoruin