Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to combine 100 CSV files with headers into one?

Tags:

What is the fastest way to combine 100 CSV files with headers into one with the following setup:

  1. The total size of files is 200 MB. (The size is reduced to make the computation time visible)
  2. The files are located on an SSD with a maximum speed of 240 MB/s.
  3. The CPU has 4 cores so multi-threading and multiple processes are allowed.
  4. There exists only one node (important for Spark)
  5. The available memory is 15 GB. So the files easily fit into memory.
  6. The OS is Linux (Debian Jessie)
  7. The computer is actually a n1-standard-4 instance in Google Cloud.

(The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here)

File 1.csv:

a,b 1,2 

File 2.csv:

a,b 3,4 

Final out.csv:

a,b 1,2 3,4 

According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method?

Benchmarks (Updated with the methods from comments and posts):

Method                      Time pure python                  0.298s sed                          1.9s awk                          2.5s R data.table                 4.4s R data.table with colClasses 4.4s Spark 2                     40.2s python pandas          1min 11.0s 

Versions of tools:

sed 4.2.2 awk: mawk 1.3.3 Nov 1996 Python 3.6.1 Pandas 0.20.1 R 3.4.0 data.table 1.10.4 Spark 2.1.1 

Code in Jupyter notebooks:

sed:

%%time !head temp/in/1.csv > temp/merged_sed.csv !sed 1d temp/in/*.csv >> temp/merged_sed.csv 

Pure Python all binary read-write with undocumented behavior of "next":

%%time with open("temp/merged_pure_python2.csv","wb") as fout:     # first file:     with open("temp/in/1.csv", "rb") as f:         fout.write(f.read())     # now the rest:         for num in range(2,101):         with open("temp/in/"+str(num)+".csv", "rb") as f:             next(f) # skip the header             fout.write(f.read()) 

awk:

%%time !awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv 

R data.table:

%%time %%R filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv")) files <- lapply(filenames, fread) merged_data <- rbindlist(files, use.names=F) fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE) 

R data.table with colClasses:

%%time %%R filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv")) files <- lapply(filenames, fread,colClasses=c(     V1="integer",     V2="integer",     V3="integer",     V4="integer",     V5="integer",     V6="integer",     V7="integer",     V8="integer",     V9="integer",     V10="integer")) merged_data <- rbindlist(files, use.names=F) fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE) 

Spark (pyspark):

%%time df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv") df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv") 

Python pandas:

%%time import pandas as pd  interesting_files = glob.glob("temp/in/*.csv") df_list = [] for filename in sorted(interesting_files):     df_list.append(pd.read_csv(filename)) full_df = pd.concat(df_list)  full_df.to_csv("temp/merged_pandas.csv", index=False) 

Data was generated by:

%%R df=data.table(replicate(10,sample(0:9,100000,rep=TRUE))) for (i in 1:100){     write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE) } 
like image 313
keiv.fly Avatar asked May 26 '17 23:05

keiv.fly


People also ask

How do I merge 10 CSV files in Python?

To merge all CSV files, use the GLOB module. The os. path. join() method is used inside the concat() to merge the CSV files together.

How do I merge multiple CSV files in Windows 10?

To merge multiple CSV files into one, you can use the Command Prompt on Windows 11/10 computer. To do so, you need to paste all the CSV files to one folder, open Command Prompt in that folder, and enter this command: copy *. csv newfile. csv.

Can you merge CSV files?

Option 1: Command Prompt If you are a Windows user, you can use the built-in Command Prompt to combine CSV files. Command Prompt is a text interface for your computer. You can type simple commands to merge files. First, put all of your CSV files in a folder and copy the full path of your folder.


2 Answers

According to the benchmarks in the question the fastest method is pure Python with undocumented "next()" function behavior with binary files. The method was proposed by Stefan Pochmann

Benchmarks:

Benchmarks (Updated with the methods from comments and posts):

Method                      Time pure python                  0.298s sed                          1.9s awk                          2.5s R data.table                 4.4s R data.table with colClasses 4.4s Spark 2                     40.2s python pandas          1min 11.0s 

Versions of tools:

sed 4.2.2 awk: mawk 1.3.3 Nov 1996 Python 3.6.1 Pandas 0.20.1 R 3.4.0 data.table 1.10.4 Spark 2.1.1 

Pure Python code:

with open("temp/merged_pure_python2.csv","wb") as fout:     # first file:     with open("temp/in/1.csv", "rb") as f:         fout.write(f.read())     # now the rest:         for num in range(2,101):         with open("temp/in/"+str(num)+".csv", "rb") as f:             next(f) # skip the header             fout.write(f.read()) 
like image 185
keiv.fly Avatar answered Sep 20 '22 18:09

keiv.fly


sed is probably the fastest. I would also propose an awk alternative

awk 'NR==1; FNR==1{next} 1' file* > output 

prints the first line from the first file, then skips all other first lines from the rest of the files.

Timings: I tried 10,000 lines long 100 files each around 200MB (not sure). Here is a worst timing on my server.

real    0m0.429s                                               user    0m0.360s                                       sys     0m0.068s  

server specs (little monster)

$ lscpu                                                                                                          Architecture:          x86_64                                                                                                              CPU op-mode(s):        32-bit, 64-bit                                                                                                      Byte Order:            Little Endian                                                                                                       CPU(s):                12                                                                                                                  On-line CPU(s) list:   0-11                                                                                                                Thread(s) per core:    1                                                                                                                   Core(s) per socket:    6                                                                                                                   Socket(s):             2                                                                                                                   NUMA node(s):          1                                                                                                                   Vendor ID:             GenuineIntel                                                                                                        CPU family:            6                                                                                                                   Model:                 63                                                                                                                  Model name:            Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz                                                                           Stepping:              2                                                                                                                   CPU MHz:               2394.345                                                                                                            BogoMIPS:              4789.86                                                                                                             Virtualization:        VT-x                                                                                                                L1d cache:             32K                                                                                                                 L1i cache:             32K                                                                                                                 L2 cache:              256K                                                                                                                L3 cache:              15360K                                                                                                              NUMA node0 CPU(s):     0-11      
like image 36
karakfa Avatar answered Sep 21 '22 18:09

karakfa