What is the fastest way to combine 100 CSV files with headers into one with the following setup: <ol> <li>The total size of files is 200 MB. (The size is reduced to make the computation time visible)</li> <li>The files are located on an SSD with a maximum speed of 240 MB/s.</li> <li>The CPU has 4 cores so multi-threading and multiple processes are allowed.</li> <li>There exists only one node (important for Spark)</li> <li>The available memory is 15 GB. So the files easily fit into memory.</li> <li>The OS is Linux (Debian Jessie)</li> <li>The computer is actually a n1-standard-4 instance in Google Cloud.</li> </ol> (The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here) File 1.csv: <pre class="prettyprint"><code>a,b 1,2 </code></pre> File 2.csv: <pre class="prettyprint"><code>a,b 3,4 </code></pre> Final out.csv: <pre class="prettyprint"><code>a,b 1,2 3,4 </code></pre> According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method? Benchmarks (Updated with the methods from comments and posts): <pre class="prettyprint"><code>Method Time pure python 0.298s sed 1.9s awk 2.5s R data.table 4.4s R data.table with colClasses 4.4s Spark 2 40.2s python pandas 1min 11.0s </code></pre> Versions of tools: <pre class="prettyprint"><code>sed 4.2.2 awk: mawk 1.3.3 Nov 1996 Python 3.6.1 Pandas 0.20.1 R 3.4.0 data.table 1.10.4 Spark 2.1.1 </code></pre> Code in Jupyter notebooks: sed: <pre class="prettyprint"><code>%%time !head temp/in/1.csv > temp/merged_sed.csv !sed 1d temp/in/*.csv >> temp/merged_sed.csv </code></pre> Pure Python all binary read-write with undocumented behavior of "next": <pre class="prettyprint"><code>%%time with open("temp/merged_pure_python2.csv","wb") as fout: # first file: with open("temp/in/1.csv", "rb") as f: fout.write(f.read()) # now the rest: for num in range(2,101): with open("temp/in/"+str(num)+".csv", "rb") as f: next(f) # skip the header fout.write(f.read()) </code></pre> awk: <pre class="prettyprint"><code>%%time !awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv </code></pre> R data.table: <pre class="prettyprint"><code>%%time %%R filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv")) files <- lapply(filenames, fread) merged_data <- rbindlist(files, use.names=F) fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE) </code></pre> R data.table with colClasses: <pre class="prettyprint"><code>%%time %%R filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv")) files <- lapply(filenames, fread,colClasses=c( V1="integer", V2="integer", V3="integer", V4="integer", V5="integer", V6="integer", V7="integer", V8="integer", V9="integer", V10="integer")) merged_data <- rbindlist(files, use.names=F) fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE) </code></pre> Spark (pyspark): <pre class="prettyprint"><code>%%time df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv") df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv") </code></pre> Python pandas: <pre class="prettyprint"><code>%%time import pandas as pd interesting_files = glob.glob("temp/in/*.csv") df_list = [] for filename in sorted(interesting_files): df_list.append(pd.read_csv(filename)) full_df = pd.concat(df_list) full_df.to_csv("temp/merged_pandas.csv", index=False) </code></pre> Data was generated by: <pre class="prettyprint"><code>%%R df=data.table(replicate(10,sample(0:9,100000,rep=TRUE))) for (i in 1:100){ write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE) } </code></pre>

According to the benchmarks in the question the fastest method is pure Python with undocumented "next()" function behavior with binary files. The method was proposed by Stefan Pochmann Benchmarks: Benchmarks (Updated with the methods from comments and posts): <pre class="prettyprint"><code>Method Time pure python 0.298s sed 1.9s awk 2.5s R data.table 4.4s R data.table with colClasses 4.4s Spark 2 40.2s python pandas 1min 11.0s </code></pre> Versions of tools: <pre class="prettyprint"><code>sed 4.2.2 awk: mawk 1.3.3 Nov 1996 Python 3.6.1 Pandas 0.20.1 R 3.4.0 data.table 1.10.4 Spark 2.1.1 </code></pre> Pure Python code: <pre class="prettyprint"><code>with open("temp/merged_pure_python2.csv","wb") as fout: # first file: with open("temp/in/1.csv", "rb") as f: fout.write(f.read()) # now the rest: for num in range(2,101): with open("temp/in/"+str(num)+".csv", "rb") as f: next(f) # skip the header fout.write(f.read()) </code></pre>

<code>sed</code> is probably the fastest. I would also propose an <code>awk</code> alternative <pre class="prettyprint"><code>awk 'NR==1; FNR==1{next} 1' file* > output </code></pre> prints the first line from the first file, then skips all other first lines from the rest of the files. Timings: I tried 10,000 lines long 100 files each around <strike>200MB</strike> (not sure). Here is a worst timing on my server. <pre class="prettyprint"><code>real 0m0.429s user 0m0.360s sys 0m0.068s </code></pre> server specs (little monster) <pre class="prettyprint"><code>$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Stepping: 2 CPU MHz: 2394.345 BogoMIPS: 4789.86 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-11 </code></pre>

What is the fastest way to combine 100 CSV files with headers into one?

Tags:

What is the fastest way to combine 100 CSV files with headers into one with the following setup:

The total size of files is 200 MB. (The size is reduced to make the computation time visible)
The files are located on an SSD with a maximum speed of 240 MB/s.
The CPU has 4 cores so multi-threading and multiple processes are allowed.
There exists only one node (important for Spark)
The available memory is 15 GB. So the files easily fit into memory.
The OS is Linux (Debian Jessie)
The computer is actually a n1-standard-4 instance in Google Cloud.

(The detailed setup was included to make the scope of the question more specific. The changes were made according to the feedback here)

File 1.csv:

a,b 1,2

File 2.csv:

a,b 3,4

Final out.csv:

a,b 1,2 3,4

According to my benchmarks the fastest from all the proposed methods is pure python. Is there any faster method?

Benchmarks (Updated with the methods from comments and posts):

Method                      Time pure python                  0.298s sed                          1.9s awk                          2.5s R data.table                 4.4s R data.table with colClasses 4.4s Spark 2                     40.2s python pandas          1min 11.0s

Versions of tools:

sed 4.2.2 awk: mawk 1.3.3 Nov 1996 Python 3.6.1 Pandas 0.20.1 R 3.4.0 data.table 1.10.4 Spark 2.1.1

Code in Jupyter notebooks:

sed:

%%time !head temp/in/1.csv > temp/merged_sed.csv !sed 1d temp/in/*.csv >> temp/merged_sed.csv

Pure Python all binary read-write with undocumented behavior of "next":

%%time with open("temp/merged_pure_python2.csv","wb") as fout:     # first file:     with open("temp/in/1.csv", "rb") as f:         fout.write(f.read())     # now the rest:         for num in range(2,101):         with open("temp/in/"+str(num)+".csv", "rb") as f:             next(f) # skip the header             fout.write(f.read())

awk:

%%time !awk 'NR==1; FNR==1{{next}} 1' temp/in/*.csv > temp/merged_awk.csv

R data.table:

%%time %%R filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv")) files <- lapply(filenames, fread) merged_data <- rbindlist(files, use.names=F) fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)

R data.table with colClasses:

%%time %%R filenames <- paste0("temp/in/",list.files(path="temp/in/",pattern="*.csv")) files <- lapply(filenames, fread,colClasses=c(     V1="integer",     V2="integer",     V3="integer",     V4="integer",     V5="integer",     V6="integer",     V7="integer",     V8="integer",     V9="integer",     V10="integer")) merged_data <- rbindlist(files, use.names=F) fwrite(merged_data, file="temp/merged_R_fwrite.csv", row.names=FALSE)

Spark (pyspark):

%%time df = spark.read.format("csv").option("header", "true").load("temp/in/*.csv") df.coalesce(1).write.option("header", "true").csv("temp/merged_pyspark.csv")

Python pandas:

%%time import pandas as pd  interesting_files = glob.glob("temp/in/*.csv") df_list = [] for filename in sorted(interesting_files):     df_list.append(pd.read_csv(filename)) full_df = pd.concat(df_list)  full_df.to_csv("temp/merged_pandas.csv", index=False)

Data was generated by:

%%R df=data.table(replicate(10,sample(0:9,100000,rep=TRUE))) for (i in 1:100){     write.csv(df,paste0("temp/in/",i,".csv"), row.names=FALSE) }

313

asked May 26 '17 23:05

keiv.fly

2 Answers

According to the benchmarks in the question the fastest method is pure Python with undocumented "next()" function behavior with binary files. The method was proposed by Stefan Pochmann

Benchmarks:

Benchmarks (Updated with the methods from comments and posts):

Method                      Time pure python                  0.298s sed                          1.9s awk                          2.5s R data.table                 4.4s R data.table with colClasses 4.4s Spark 2                     40.2s python pandas          1min 11.0s

Versions of tools:

sed 4.2.2 awk: mawk 1.3.3 Nov 1996 Python 3.6.1 Pandas 0.20.1 R 3.4.0 data.table 1.10.4 Spark 2.1.1

Pure Python code:

with open("temp/merged_pure_python2.csv","wb") as fout:     # first file:     with open("temp/in/1.csv", "rb") as f:         fout.write(f.read())     # now the rest:         for num in range(2,101):         with open("temp/in/"+str(num)+".csv", "rb") as f:             next(f) # skip the header             fout.write(f.read())

185

answered Sep 20 '22 18:09

keiv.fly

sed is probably the fastest. I would also propose an awk alternative

awk 'NR==1; FNR==1{next} 1' file* > output

prints the first line from the first file, then skips all other first lines from the rest of the files.

Timings: I tried 10,000 lines long 100 files each around ~~200MB~~ (not sure). Here is a worst timing on my server.

real    0m0.429s                                               user    0m0.360s                                       sys     0m0.068s

server specs (little monster)

$ lscpu                                                                                                          Architecture:          x86_64                                                                                                              CPU op-mode(s):        32-bit, 64-bit                                                                                                      Byte Order:            Little Endian                                                                                                       CPU(s):                12                                                                                                                  On-line CPU(s) list:   0-11                                                                                                                Thread(s) per core:    1                                                                                                                   Core(s) per socket:    6                                                                                                                   Socket(s):             2                                                                                                                   NUMA node(s):          1                                                                                                                   Vendor ID:             GenuineIntel                                                                                                        CPU family:            6                                                                                                                   Model:                 63                                                                                                                  Model name:            Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz                                                                           Stepping:              2                                                                                                                   CPU MHz:               2394.345                                                                                                            BogoMIPS:              4789.86                                                                                                             Virtualization:        VT-x                                                                                                                L1d cache:             32K                                                                                                                 L1i cache:             32K                                                                                                                 L2 cache:              256K                                                                                                                L3 cache:              15360K                                                                                                              NUMA node0 CPU(s):     0-11

answered Sep 21 '22 18:09

karakfa

Related questions
                            
                                Is it possible to set custom CPU throttling in Chrome DevTools?
                            
                                How to test Vue watcher that watches a computed property from VueX?
                            
                                What does os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'folder.settings') do?
                            
                                Why do my SFINAE expressions no longer work with GCC 8.2?
                            
                                Sign in to Google account with old password - how to redirect to blue Google Sign-In page?
                            
                                Cannot load backend 'Qt5Agg' which requires the 'qt5' interactive framework, as 'headless' is currently running
                            
                                generate PHP classes from XSD?
                            
                                Is there a recommended strategy for releasing a beta version of an application to a limited number of users through the Android Market?
                            
                                Compiling Qt for Windows 98
                            
                                ASP.Net httpruntime executionTimeout not working (and yes debug=false)
                            
                                std::map with efficient nth element access
                            
                                geospatial queries in javascript [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to combine 100 CSV files with headers into one?

Tags:

keiv.fly

People also ask

2 Answers

keiv.fly

karakfa

Recent Activity

Donate For Us