I have a huge CSV file, 1m lines. I was wondering if there is a way to split this file into smaller ones but keeping the first line (CSV header) on all the files.
It seems split
is very fast but is also very limited. You cannot add a suffix to the filenames like .csv
.
split -l11000 products.csv file_
Is there an effective way to do this task in just bash
? A one-line command would be great.
The CSV format is not standardized, so various implementations exist. In 2005, the Internet Society published guidelines for creating CSV files. They wrote down best practices to structure and process CSV data. From those guidelines and giving the lack of standardization, the header line is optional in a CSV file.
CSV File Splitter is a lightweight application which splits huge comma separated values files into multiple smaller files, allowing further data analysis in Excel with its 1,048,576 row limit.
The answer to this question is yes, this is possible with AWK.
The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv
:
awk -v l=11000 '(NR==1){header=$0;next}
(NR%l==2) {
close(file);
file=sprintf("%s.%0.5d.csv",FILENAME,++c)
sub(/csv[.]/,"",file)
print header > file
}
{print > file}' file.csv
This works in the following way:
(NR==1){header=$0;next}
: If the record/line is the first line, save that line as the header.(NR%l==2){...}
: Every time we wrote l=11000
records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:
close(file)
: close the file you just wrote too.file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file)
: define the new filename as FILENAME.00XXX.csv
print header > file
: open the file and write the header to that file.{print > file}
: write the entries to the file.note: If you don't care about the filename, you can use the following shorter version:
awk -v m=100 '
(NR==1){h=$0;next}
(NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
{print > f}' file.csv
Using GNU split
to split file.csv
:
export inputPrefix='file' parts=16 && split --verbose -d -n l/${parts} --additional-suffix=.csv --filter='([ "$FILE" != "${inputPrefix}.00.csv" ] && head -1 "${inputPrefix}.csv" ; cat) > "$FILE"' "${inputPrefix}.csv" "${inputPrefix}."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With