Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split file by percentage of no. of lines?

How to split file by percentage of no. of lines?

Let's say I want to split my file into 3 portions (60%/20%/20% parts), I could do this manually, -_- :

$ wc -l brown.txt 
57339 brown.txt

$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339

$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
   34398 part1.txt
   11466 part2.txt
   11475 part3.txt
   57339 total

But I'm sure there's a better way!

like image 632
alvas Avatar asked Nov 04 '16 05:11

alvas


1 Answers

There is a utility that takes as arguments the line numbers that should become the first of each respective new file: csplit. This is a wrapper around its POSIX version:

#!/bin/bash

usage () {
    printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}

# Collect csplit options
while getopts "ksf:n:" opt; do
    case "$opt" in
        k|s) args+=(-"$opt") ;;           # k: no remove on error, s: silent
        f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
        *) usage; exit 1 ;;
    esac
done
shift $(( OPTIND - 1 ))

fname=$1
shift
ratios=("$@")

len=$(wc -l < "$fname")

# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[@]}"; do
    (( total += ratio ))
    cumsums+=("$total")
done

# Don't need the last element
unset cumsums[-1]

# Array of numbers of first line in each split file
for sum in "${cumsums[@]}"; do
    linenums+=( $(( sum * len / total + 1 )) )
done

csplit "${args[@]}" "$fname" "${linenums[@]}"

After the name of the file to split up, it takes the ratios for the sizes of the split files relative to their sum, i.e.,

percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1

are all equivalent.

Usage similar to the case in the question is as follows:

$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
 34403 part0
 11468 part1
 11468 part2
 57339 total

Numbering starts with zero, though, and there is no txt extension. The GNU version supports a --suffix-format option that would allow for .txt extension and which could be added to the accepted arguments, but that would require something more elaborate than getopts to parse them.

This solution plays nice with very short files (split file of two lines into two) and the heavy lifting is done by csplit itself.

like image 101
Benjamin W. Avatar answered Oct 15 '22 17:10

Benjamin W.