Select random lines from a file

People also ask

How do I read a random line from a file?

We can use the Python random module to help us get a random line from a file. To get all of the lines in a file, first use the read() and splitlines() functions. Then, you can use the random. choice() function to get a random line from the file.

What was the command in the Unix terminal to get a random sample of a file?

The shuf command generates random permutations from input lines to standard output. If given a file or series of files it will shuffle the lines and write the result to standard output. It can also limit the number of results returned supporting selecting random lines from a file or data from a list.

How do I count the number of lines in a file in bash?

The wc command is used to find the number of lines, characters, words, and bytes of a file. To find the number of lines using wc, we add the -l option. This will give us the total number of lines and the name of the file.

Use shuf with the -n option as shown below, to get N random lines:

shuf -n N input > output

Sort the file randomly and pick first 100 lines:

lines=100
input_file=/usr/share/dict/words

# This is the basic selection method
<$input_file sort -R | head -n $lines

# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines

# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines

Of course <$input_file can be replaced with any piped standard input. This (sort -R and $'...\t...' to get sed to match tab chars) works with GNU/Linux and BSD/macOS.

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

Challenge accepted...

EDIT: I beat my own record

powershuf did it in 0.047 seconds

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.

Gitlab Repo

Old attempt

First I needed a file of 78.000.000.000 lines:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billion newlines ;-)

Now for the shuf part:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

Python is what I regularly use so that's what I'll use to make this faster:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

I know it can get faster but I'll leave some room to give others a try.

Line counter source: Luther Blissett

My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed

time \
awk 'BEGIN  {srand()} 
     !/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt

# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total

Related questions
                            
                                Get last dirname/filename in a file path argument in Bash
                            
                                How to read from a file or standard input in Bash
                            
                                Bash conditionals: how to "and" expressions? (if [ ! -z $VAR && -e $VAR ])
                            
                                How can I ssh directly to a particular directory?
                            
                                How can I detect if my shell script is running through a pipe?
                            
                                How can you run a command in bash over and over until success?
                            
                                What is the difference between $(command) and `command` in shell programming?
                            
                                How can I repeat a character in Bash?
                            
                                Find the files existing in one directory but not in the other [closed]
                            
                                Run git pull over all subdirectories [duplicate]
                            
                                How do I syntax check a Bash script without running it?
                            
                                How can I print each command before executing? [duplicate]
                            
                                Simple logical operators in Bash
                            
                                "unary operator expected" error in Bash if condition
                            
                                How to replace spaces in file names using a bash script
                            
                                How to programmatically determine the current checked out Git branch [duplicate]
                            
                                How to remove last n characters from a string in Bash?
                            
                                How to execute a bash command stored as a string with quotes and asterisk [duplicate]
                            
                                Convert absolute path into relative path given a current directory using Bash
                            
                                How does one output bold text in Bash?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select random lines from a file

Tags:

bash

shell

random

text-processing

People also ask

powershuf did it in 0.047 seconds

Old attempt

Recent Activity

Donate For Us