We can use the Python random module to help us get a random line from a file. To get all of the lines in a file, first use the read() and splitlines() functions. Then, you can use the random. choice() function to get a random line from the file.
The shuf command generates random permutations from input lines to standard output. If given a file or series of files it will shuffle the lines and write the result to standard output. It can also limit the number of results returned supporting selecting random lines from a file or data from a list.
The wc command is used to find the number of lines, characters, words, and bytes of a file. To find the number of lines using wc, we add the -l option. This will give us the total number of lines and the name of the file.
Use shuf
with the -n
option as shown below, to get N
random lines:
shuf -n N input > output
Sort the file randomly and pick first 100
lines:
lines=100
input_file=/usr/share/dict/words
# This is the basic selection method
<$input_file sort -R | head -n $lines
# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq | sort -R | head -n $lines
# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines
Of course <$input_file
can be replaced with any piped standard input. This (sort -R
and $'...\t...'
to get sed
to match tab chars) works with GNU/Linux and BSD/macOS.
Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.
Challenge accepted...
EDIT: I beat my own record
$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 0.02s user 0.01s system 80% cpu 0.047 total
The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.
Gitlab Repo
First I needed a file of 78.000.000.000 lines:
seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt
This gives me a a file with 78 Billion newlines ;-)
Now for the shuf part:
$ time shuf -n 10 lines_78000000000.txt
shuf -n 10 lines_78000000000.txt 2171.20s user 22.17s system 99% cpu 36:35.80 total
The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.
Python is what I regularly use so that's what I'll use to make this faster:
#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
buffer = f.read(65536)
if not buffer: break
count += buffer.count('\n')
for i in range(10):
f.readline(random.randint(1, count))
This got me just under a minute:
$ time ./shuf.py
./shuf.py 42.57s user 16.19s system 98% cpu 59.752 total
I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.
I know it can get faster but I'll leave some room to give others a try.
Line counter source: Luther Blissett
My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.
# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed
time \
awk 'BEGIN {srand()}
!/^$/ { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt
# awk tsv004 3.76s user 1.46s system 91% cpu 5.716 total
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With