I have a huge file on a linux machine. The file is ~20GB and the space on my box is ~25GB. I want to split the file into ~100mb parts. I know theres a 'split' command but that keeps the original file. I don't have enough space to keep the original. Any ideas on how this can be acomplished? I'll even work with any node modules if they make the task easier than bash.
My attempt:
#! /bin/bash
if [ $# -gt 2 -o $# -lt 1 -o ! -f "$1" ]; then
echo "Usage: ${0##*/} <filename> [<split size in M>]" >&2
exit 1
fi
bsize=${2:-100}
bucket=$( echo $bsize '* 1024 * 1024' | bc )
size=$( stat -c '%s' "$1" )
chunks=$( echo $size / $bucket | bc )
rest=$( echo $size % $bucket | bc )
[ $rest -ne 0 ] && let chunks++
while [ $chunks -gt 0 ]; do
let chunks--
fn=$( printf '%s_%03d.%s' "${1%.*}" $chunks "${1##*.}" )
skip=$(( bsize * chunks ))
dd if="$1" of="$fn" bs=1M skip=${skip} || exit 1
truncate -c -s ${skip}M "$1" || exit 1
done
The above assumes bash(1)
, and Linux implementations of stat(1)
, dd(1)
, and truncate(1)
. It should be pretty much as fast as it gets, since it uses dd(1)
to copy chunks of the initial file. It also uses bc(1)
to make sure arithmetic operations in the 20GB range don't overflow anything. However, the script was only tested on smaller files, so double check it before running it against your data.
You can use tail and truncate in a shell script to split a file in place, while destroying the original file. We are splitting the file in place backwards so that we can use the truncate. Here is a sample Bash script:
#!/bin/bash
if [ -z "$2" ]; then
echo "Usage: insplit.sh <splitsize> <filename>"
exit 1
fi
FILE="$2"
SPLITSIZE="$1"
FILESIZE=`stat -c '%s' $FILE`
BLOCKCOUNT=$(( (FILESIZE+SPLITSIZE-1)/SPLITSIZE ))
echo "Split count: $BLOCKCOUNT"
BLOCKCOUNT=$(($BLOCKCOUNT-1))
while [ $BLOCKCOUNT -ge 0 ]; do
FNAME="$FILE.$BLOCKCOUNT"
echo "writing $FNAME"
OFFSET=$((BLOCKCOUNT * SPLITSIZE))
BLOCKSIZE=$(( $FILESIZE - $OFFSET))
tail -c "$BLOCKSIZE" $FILE > $FNAME
truncate -s $OFFSET $FILE
FILESIZE=$((FILESIZE-BLOCKSIZE))
BLOCKCOUNT=$(( $BLOCKCOUNT-1 ))
done
I confirmed the results with a random file:
$ dd if=/dev/urandom of=largefile bs=512 count=1000
$ md5sum largefile
7ff913b62ef572265661a85f06417746 largefile
$ ./insplit.sh 200000 largefile
Split count: 3
writing largefile.2
writing largefile.1
writing largefile.0
$ cat largefile.0 largefile.1 largefile.2 | md5sum
7ff913b62ef572265661a85f06417746 -
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With