I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.
Example file:
Events
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43
5,0.62,0.21,0.77,0.44,0.16
Here is what I have so far:
# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
filelist.append(files)
# read each csv file into single dataframe and add a filename reference column
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
key = "file%i" % c
frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
frame['key'] = key
df = df.append(frame,ignore_index=True)
(the indexing isn't working properly)
Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:
df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])
I have found many related links, however I am still not able to get this to work:
To merge all CSV files, use the GLOB module. The os. path. join() method is used inside the concat() to merge the CSV files together.
You can do this by reading each CSV file into DataFrame and appending or concatenating the DataFrames to create a single DataFrame with data from all files. Here, I will use read_csv() to read CSV files and concat() function to concatenate DataFrams together to create one big DataFrame.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:
The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat
will be doing. This is my recipe:
from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)
Notice that read_csv
is transposed with axis=1
, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T
.
EDIT:
For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:
def reader(f):
d = read_csv(f, index_col=0, header=None, axis=1)
d.columns = range(d.shape[1])
return d
df = concat([reader(f) for f in files], keys=files)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With