Split CSV files into smaller files but keeping the headers?

Tags:

I have a huge CSV file, 1m lines. I was wondering if there is a way to split this file into smaller ones but keeping the first line (CSV header) on all the files.

It seems split is very fast but is also very limited. You cannot add a suffix to the filenames like .csv.

split -l11000 products.csv file_

Is there an effective way to do this task in just bash? A one-line command would be great.

260

asked Jul 19 '18 11:07

neisantos

2 Answers

The answer to this question is yes, this is possible with AWK.

The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv:

awk -v l=11000 '(NR==1){header=$0;next}
                (NR%l==2) {
                   close(file); 
                   file=sprintf("%s.%0.5d.csv",FILENAME,++c)
                   sub(/csv[.]/,"",file)
                   print header > file
                }
                {print > file}' file.csv

This works in the following way:

(NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
(NR%l==2){...}: Every time we wrote l=11000 records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:
- close(file): close the file you just wrote too.
- file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file): define the new filename as FILENAME.00XXX.csv
- print header > file: open the file and write the header to that file.
{print > file}: write the entries to the file.

note: If you don't care about the filename, you can use the following shorter version:

awk -v m=100 '
    (NR==1){h=$0;next}
    (NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
    {print > f}' file.csv

answered Sep 18 '22 13:09

kvantour

Using GNU split to split file.csv:

export inputPrefix='file' parts=16 && split --verbose -d -n l/${parts} --additional-suffix=.csv --filter='([ "$FILE" != "${inputPrefix}.00.csv" ] && head -1 "${inputPrefix}.csv" ; cat) > "$FILE"' "${inputPrefix}.csv" "${inputPrefix}."

answered Sep 17 '22 13:09

nzkeith

Related questions
                            
                                Using shell script to insert data into remote MYSQL database
                            
                                Looping through directories in Bash
                            
                                Comparison function that compares two text files in Unix
                            
                                ansible answers to mysql_secure_installation
                            
                                uniq - skipping last N characters/fields when comparing lines
                            
                                bash "map" equivalent: run command on each file [duplicate]
                            
                                Can you use heredocuments to embed AWK in a bash script?
                            
                                Linux alias chain commands (can recursion be avoided?)
                            
                                Where is PATH variable set in Ubuntu? [duplicate]
                            
                                Authentication failed on tfs server
                            
                                Why is this bash prompt acting strangely/disappearing, and how do I fix it (OS X)?
                            
                                How to set current date as git commit message
                            
                                Cygwin gitk issue
                            
                                Bash Boolean testing
                            
                                How can I find lines in one file but not the other using bash scripting?
                            
                                Delete whitespace in each begin of line of file, using bash
                            
                                Export variables defined in another file
                            
                                Adjust audio volume level with CLI omxplayer - Raspberry Pi
                            
                                Order of /usr/bin and /usr/local/bin and more in $PATH
                            
                                cURL Simple File Upload - 417 Expectation Failed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With