Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a large text file into smaller files with an equal number of lines?

Tags:

file

bash

unix

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

I could do this fairly easily in Python, but I'm wondering if there's any kind of ninja way to do this using Bash and Unix utilities (as opposed to manually looping and counting / partitioning lines).

like image 859
danben Avatar asked Jan 06 '10 22:01

danben


People also ask

How do I make large files into small files?

Open that folder, then select File, New, Compressed (zipped) folder. Type a name for the compressed folder and press enter. Your new compressed folder will have a zipper on its icon to indicate that any files contained in it are compressed. To compress files (or make them smaller) simply drag them into this folder.

How do I split a large file into smaller parts in Windows?

Right-click the file and select the Split operation from the program's context menu. This opens a new configuration window where you need to specify the destination for the split files and the maximum size of each volume. You can select one of the pre-configured values or enter your own into the form directly.


11 Answers

Have a look at the split command:

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

You could do something like this:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.

like image 146
Mark Byers Avatar answered Oct 02 '22 00:10

Mark Byers


Use the split command:

split -l 200000 mybigfile.txt
like image 30
Robert Christie Avatar answered Oct 01 '22 23:10

Robert Christie


Yes, there is a split command. It will split a file by lines or bytes.

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
like image 44
Dave Kirby Avatar answered Oct 01 '22 23:10

Dave Kirby


Split the file "file.txt" into 10,000-lines files:

split -l 10000 file.txt
like image 29
ialqwaiz Avatar answered Oct 02 '22 00:10

ialqwaiz


Use split:

Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-')

Syntax split [options] [INPUT [PREFIX]]

like image 28
zmbush Avatar answered Oct 01 '22 23:10

zmbush


Use:

sed -n '1,100p' filename > output.txt

Here, 1 and 100 are the line numbers which you will capture in output.txt.

like image 41
Harshwardhan Avatar answered Oct 01 '22 22:10

Harshwardhan


split (from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:

-n, --number=CHUNKS     generate CHUNKS output files; see explanation below

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

Thus, split -n 4 input output. will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.

If we want to preserve full lines (i.e. split by lines), then this should work:

split -n l/4 input output.

Related answer: https://stackoverflow.com/a/19031247

like image 39
Denilson Sá Maia Avatar answered Oct 02 '22 00:10

Denilson Sá Maia


You can also use AWK:

awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile
like image 39
ghostdog74 Avatar answered Oct 02 '22 00:10

ghostdog74


To split a large text file into smaller files of 1000 lines each:

split <file> -l 1000

To split a large binary file into smaller files of 10M each:

split <file> -b 10M

To consolidate split files into a single file:

cat x* > <file>

Split a file, each split having 10 lines (except the last split):

split -l 10 filename

Split a file into 5 files. File is split such that each split has same size (except the last split):

split -n 5 filename

Split a file with 512 bytes in each split (except the last split; use 512k for kilobytes and 512m for megabytes):

split -b 512 filename

Split a file with at most 512 bytes in each split without breaking lines:

split -C 512 filename

--> by : cht.sh

like image 27
BuGaU0 Avatar answered Oct 02 '22 00:10

BuGaU0


In case you just want to split by x number of lines each file, the given answers about split are OK. But, I am curious about why no one paid attention to the requirements:

  • "without having to count them" -> using wc + cut
  • "having the remainder in extra file" -> split does by default

I can't do that without "wc + cut", but I'm using that:

split -l  $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename

This can be easily added to your .bashrc file functions, so you can just invoke it, passing the filename and chunks:

 split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2) $1

In case you want just x chunks without remainder in the extra file, just adapt the formula to sum it (chunks - 1) on each file. I do use this approach because usually I just want x number of files rather than x lines per file:

split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2 + `expr $2 - 1`) $1

You can add that to a script and call it your "ninja way", because if nothing suites your needs, you can build it :-)

like image 43
m3nda Avatar answered Oct 01 '22 22:10

m3nda


HDFS getmerge small file and split into a proper size.

This method will cause line breaks:

split -b 125m compact.file -d -a 3 compact_prefix

I try to getmerge and split into about 128 MB for every file.

# Split into 128 MB, and judge sizeunit is M or G. Please test before use.

begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
    res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
    res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # Celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# Split into $res files with a number suffix. Ref:  http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name: "$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}
like image 24
Matiji66 Avatar answered Oct 01 '22 23:10

Matiji66