I am trying to generate a large csv with random content in bash. My machine has 6 cores and 12G ram but my script (see below) takes 140 seconds for only 10k lines with 3 columns. Is there any way to optimize this script?
Are there considerably faster ways of generating random csv files in other languages?
#!/bin/bash
csv="foo\tbar\tbaz"
start=$(date)
for i in `seq 1 $1`
do rand=$(($i * $RANDOM))
str0="$$"$i
str1=$( echo "$str0" | md5sum )
randstring1="${str1:2:8}"
randstring2="${str1:0:2}"
csv="$csv\n$randstring1\t$randstring2\t$rand"
done
end=$(date)
datediff=$(( $(date -d "$end" +%s) - $(date -d "$start" +%s)))
echo -e $csv > my_csv.csv
echo "script took $datediff seconds for $(wc -l my_csv.csv) lines"
To replace this script fairly precisely (format-wise), you could use
hexdump -v -e '5/1 "%02x""\n"' /dev/urandom |
awk -v OFS='\t' '
NR == 1 { print "foo", "bar", "baz" }
{ print substr($0, 1, 8), substr($0, 9, 2), int(NR * 32768 * rand()) }' |
head -n "$1" > my_csv.csv
This falls into three parts:
hexdump -v -e '5/1 "%02x""\n"' /dev/urandom
extracts from /dev/urandom
sequences of five bytes and formats then as hexadecimal strings,
awk -v OFS='\t' '
NR == 1 { print "foo", "bar", "baz" }
{ print substr($0, 1, 8), substr($0, 9, 2), int(NR * 32768 * rand()) }'
formats the lines appropriately while adding a field that is the equivalent of $(($i * $RANDOM))
and a header line, and
head -n "$1"
takes the first $1
lines of this. When head
quits, the pipe to awk is closed, awk
quits, the pipe to hexdump
is closed, and hexdump
quits, so this makes the whole thing end at the right time.
On my machine (a Haswell i5), running this takes 0.83 seconds for a million lines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With