Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

generate large csv with random content in bash

Tags:

bash

unix

csv

I am trying to generate a large csv with random content in bash. My machine has 6 cores and 12G ram but my script (see below) takes 140 seconds for only 10k lines with 3 columns. Is there any way to optimize this script?

Are there considerably faster ways of generating random csv files in other languages?

#!/bin/bash

csv="foo\tbar\tbaz"
start=$(date)
for i in `seq 1 $1`
  do rand=$(($i * $RANDOM))
  str0="$$"$i
  str1=$( echo "$str0" | md5sum )
  randstring1="${str1:2:8}"
  randstring2="${str1:0:2}"
  csv="$csv\n$randstring1\t$randstring2\t$rand"
done
end=$(date)
datediff=$(( $(date -d "$end" +%s) - $(date -d "$start" +%s)))
echo -e $csv > my_csv.csv
echo "script took $datediff seconds for $(wc -l my_csv.csv) lines"
like image 329
jvdh Avatar asked Dec 12 '22 00:12

jvdh


1 Answers

To replace this script fairly precisely (format-wise), you could use

hexdump -v -e '5/1 "%02x""\n"' /dev/urandom |
  awk -v OFS='\t' '
    NR == 1 { print "foo", "bar", "baz" }
    { print substr($0, 1, 8), substr($0, 9, 2), int(NR * 32768 * rand()) }' |
  head -n "$1" > my_csv.csv

This falls into three parts:

hexdump -v -e '5/1 "%02x""\n"' /dev/urandom

extracts from /dev/urandom sequences of five bytes and formats then as hexadecimal strings,

awk -v OFS='\t' '
    NR == 1 { print "foo", "bar", "baz" }
    { print substr($0, 1, 8), substr($0, 9, 2), int(NR * 32768 * rand()) }'

formats the lines appropriately while adding a field that is the equivalent of $(($i * $RANDOM)) and a header line, and

head -n "$1"

takes the first $1 lines of this. When head quits, the pipe to awk is closed, awk quits, the pipe to hexdump is closed, and hexdump quits, so this makes the whole thing end at the right time.

On my machine (a Haswell i5), running this takes 0.83 seconds for a million lines.

like image 188
Wintermute Avatar answered Jan 07 '23 12:01

Wintermute