What is the fastest way to combine 100 CSV files with headers into one with the following setup:
(The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here)
File 1.csv:
a,b 1,2
File 2.csv:
a,b 3,4
Final out.csv:
a,b 1,2 3,4
According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method?
Benchmarks (Updated with the methods from comments and posts):
Method Time pure python 0.298s sed 1.9s awk 2.5s R data.table 4.4s R data.table with colClasses 4.4s Spark 2 40.2s python pandas 1min 11.0s
Versions of tools:
sed 4.2.2 awk: mawk 1.3.3 Nov 1996 Python 3.6.1 Pandas 0.20.1 R 3.4.0 data.table 1.10.4 Spark 2.1.1
Code in Jupyter notebooks:
sed:
%%time !head temp/in/1.csv > temp/merged_sed.csv !sed 1d temp/in/*.csv >> temp/merged_sed.csv
Pure Python all binary read-write with undocumented behavior of "next":
%%time with open("temp/merged_pure_python2.csv","wb") as fout: # first file: with open("temp/in/1.csv", "rb") as f: fout.write(f.read()) # now the rest: for num in range(2,101): with open("temp/in/"+str(num)+".csv", "rb") as f: next(f) # skip the header fout.write(f.read())
awk:
%%time !awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv
R data.table:
%%time %%R filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv")) files <- lapply(filenames, fread) merged_data <- rbindlist(files, use.names=F) fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)
R data.table with colClasses:
%%time %%R filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv")) files <- lapply(filenames, fread,colClasses=c( V1="integer", V2="integer", V3="integer", V4="integer", V5="integer", V6="integer", V7="integer", V8="integer", V9="integer", V10="integer")) merged_data <- rbindlist(files, use.names=F) fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)
Spark (pyspark):
%%time df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv") df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv")
Python pandas:
%%time import pandas as pd interesting_files = glob.glob("temp/in/*.csv") df_list = [] for filename in sorted(interesting_files): df_list.append(pd.read_csv(filename)) full_df = pd.concat(df_list) full_df.to_csv("temp/merged_pandas.csv", index=False)
Data was generated by:
%%R df=data.table(replicate(10,sample(0:9,100000,rep=TRUE))) for (i in 1:100){ write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE) }
To merge all CSV files, use the GLOB module. The os. path. join() method is used inside the concat() to merge the CSV files together.
To merge multiple CSV files into one, you can use the Command Prompt on Windows 11/10 computer. To do so, you need to paste all the CSV files to one folder, open Command Prompt in that folder, and enter this command: copy *. csv newfile. csv.
Option 1: Command Prompt If you are a Windows user, you can use the built-in Command Prompt to combine CSV files. Command Prompt is a text interface for your computer. You can type simple commands to merge files. First, put all of your CSV files in a folder and copy the full path of your folder.
According to the benchmarks in the question the fastest method is pure Python with undocumented "next()" function behavior with binary files. The method was proposed by Stefan Pochmann
Benchmarks:
Benchmarks (Updated with the methods from comments and posts):
Method Time pure python 0.298s sed 1.9s awk 2.5s R data.table 4.4s R data.table with colClasses 4.4s Spark 2 40.2s python pandas 1min 11.0s
Versions of tools:
sed 4.2.2 awk: mawk 1.3.3 Nov 1996 Python 3.6.1 Pandas 0.20.1 R 3.4.0 data.table 1.10.4 Spark 2.1.1
Pure Python code:
with open("temp/merged_pure_python2.csv","wb") as fout: # first file: with open("temp/in/1.csv", "rb") as f: fout.write(f.read()) # now the rest: for num in range(2,101): with open("temp/in/"+str(num)+".csv", "rb") as f: next(f) # skip the header fout.write(f.read())
sed
is probably the fastest. I would also propose an awk
alternative
awk 'NR==1; FNR==1{next} 1' file* > output
prints the first line from the first file, then skips all other first lines from the rest of the files.
Timings: I tried 10,000 lines long 100 files each around 200MB (not sure). Here is a worst timing on my server.
real 0m0.429s user 0m0.360s sys 0m0.068s
server specs (little monster)
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Stepping: 2 CPU MHz: 2394.345 BogoMIPS: 4789.86 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With