Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way to merge multiple csv files by column?

Tags:

python

merge

csv

I have about 50 CSV files with 60,000 rows in each, and a varying number of columns. I want to merge all the CSV files by column. I've tried doing this in MATLAB by transposing each csv file and re-saving to disk, and then using the command line to concatenate them. This took my computer over a week and the final result needs to transposed once again! I have to do this again, and I'm looking for a solution that won't take another week. Any help would be appreciated.

like image 993
ankit Avatar asked Aug 09 '13 05:08

ankit


People also ask

How do I merge two CSV files in the same column?

Now to merge the two CSV files you have to use the dataframe. merge() method and define the column, you want to do merging. If the data is not available for the specific columns in the other sheets then the corresponding rows will be deleted. You can verify using the shape() method.

Can you merge multiple CSV?

Option 1: Command Prompt If you are a Windows user, you can use the built-in Command Prompt to combine CSV files. Command Prompt is a text interface for your computer. You can type simple commands to merge files. First, put all of your CSV files in a folder and copy the full path of your folder.


2 Answers

[...] transposing each csv file and re-saving to disk, and then using the command line to concatenate them [...]

Sounds like Transpose-Cat-Transpose. Use paste for joining files horizontally.

paste -d ',' a.csv b.csv c.csv ... > result.csv
like image 148
Thomas B. Avatar answered Sep 18 '22 15:09

Thomas B.


The Python csv module can be set up so that each record is a dictionary with the column names as keys. You should that way be able to read in all the files as dictionaries, and write them to an out-file that has all columns.

Python is easy to use, so this should be fairly trivial for a programmer of any language.

If your csv-files doesn't have column headings, this will be quite a lot of manual work, though, so then it's perhaps not the best solution.

Since these files are fairly big, it's best not to read all of them into memory once. I'd recommend that you first open them only to collect all column names into a list, and use that list to create the output file. Then you can concatenate each input file to the output file without having to have all of the files in memory.

like image 30
Lennart Regebro Avatar answered Sep 22 '22 15:09

Lennart Regebro