Randomly sampling lines from a file

Q: What was the command in the Unix terminal to get a random sample of a file?

The shuf command generates random permutations from input lines to standard output. If given a file or series of files it will shuffle the lines and write the result to standard output. It can also limit the number of results returned supporting selecting random lines from a file or data from a list.

Q: How do you use Shuf?

Ways of using shuf command: Suppose file. txt contains 6 lines, then the shuf command displays the input lines in random order as output. Any number of lines can be randomized by using -n option. This will display any two random lines from the file.

Tags:

python

bash

sed

I have a csv file which is ~40gb and 1800000 lines.

I want to randomly sample 10,000 lines and print them to a new file.

Right now, my approach is to use sed as:

(sed -n '$vars' < input.txt) > output.txt

Where $vars is a randomly generated list of lines. (Eg: 1p;14p;1700p;...;10203p)

While this works, it takes about 5 minutes per execution. It's not a huge time, but I was wondering if anybody had ideas on how to make it quicker?

526

asked Jan 01 '18 14:01

MrD

1 Answers

The biggest advantage to having lines of the same length is that you don't need to find newlines to know where each line starts. With a file size of ~40GB containing ~1.8M lines, you have a line length of ~20KB/line. If you want to sample 10K lines, you have ~40MB between lines. This is almost certainly around three orders of magnitude larger than the size of a block on your disk. Therefore, seeking to the next read location is much much more efficient than reading every byte in the file.

Seeking will work with files that have unequal line lenghs (e.g., non-ascii characters in UTF-8 encoding), but will require minor modifications to the method. If you have unequal lines, you can seek to an estimated location, then scan to the start of the next line. This is still quite efficient because you will be skipping ~40MB for every ~20KB you need to read. Your sampling uniformity will be compromised slightly since you will select byte locations instead of line locations, and you won't know which line number you are reading for sure.

You can implement your solution directly with the Python code that generates your line numbers. Here is a sample of how to deal with lines that all have the same number of bytes (usually ascii encoding):

import random
from os.path import getsize

# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000

file_size = getsize(file_name)
with open(file_name) as file:
    # Read the first line to get the length
    file.readline()
    line_size = file.tell()
    # You don't have to seek(0) here: if line #0 is selected,
    # the seek will happen regardless later.

    # Assuming you are 100% sure all lines are equal, this might
    # discard the last line if it doesn't have a trailing newline.
    # If that bothers you, use `math.round(file_size / line_size)`
    line_count = file_size // line_size
    # This is just a trivial example of how to generate the line numbers.
    # If it doesn't work for you, just use the method you already have.
    # By the way, this will just error out (ValueError) if you try to
    # select more lines than there are in the file, which is ideal
    selection_indices = random.sample(range(line_count), selection_count)
    selection_indices.sort()

    # Now skip to each line before reading it:
    prev_index = 0
    for line_index in selection_indices:
        # Conveniently, the default seek offset is the start of the file,
        # not from current position
        if line_index != prev_index + 1:
            file.seek(line_index * line_size)
        print('Line #{}: {}'.format(line_index, file.readline()), end='')
        # Small optimization to avoid seeking consecutive lines.
        # Might be unnecessary since seek probably already does
        # something like that for you
        prev_index = line_index

If you are willing to sacrifice a (very) small amount of uniformity in the distribution of line numbers, you can easily apply a similar technique to files with unequal line lengths. You just generate random byte offsets, and skip to the next full line after the offset. In the following implementation, it is assumed that you know for a fact that no line is longer than 40KB in length. You would have to do something like this if your CSV had non-ascii unicode characters encoded in UTF-8, because even if the lines all contained the same number of characters, they would contain different numbers of bytes. In this case, you would have to open the file in binary mode, since otherwise you might run into decoding errors when you skip to a random byte, if that byte happens to be mid-character:

import random
from os.path import getsize

# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes, not chars
# This serves two purposes:
#   1. It determines the margin to use from the end of the file
#   2. It determines the closest two offsets are allowed to be and
#      still be 100% guaranteed to be in different lines
max_line_bytes = 40000

file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples, at least `max_line_bytes` apart from each
# other, in the range [0, file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count, file_size, max_line_bytes)
with open(file_name, 'rb') as file:
    for offset in selection_offsets:
        # Skip to each offset
        file.seek(offset)
        # Readout to the next full line
        file.readline()
        # Print the next line. You don't know the number.
        # You also have to decode it yourself.
        print(file.readline().decode('utf-8'), end='')

All code here is Python 3.

138

answered Sep 28 '22 20:09

Mad Physicist

Related questions
                            
                                Matplotlib: expand legend vertically
                            
                                Adding silent frame to wav file using python
                            
                                Pandas groupby() on one column and then sum on another
                            
                                Python - pysftp / paramiko - Verify host key using its fingerprint
                            
                                Python SSL Certification Problems in Tensorflow
                            
                                Python to close own CMD shell window on exit
                            
                                Is line-joining unsupported by f-strings?
                            
                                Is there a way to close the file PdfFileReader opens?
                            
                                How to send FIX logon message with Python to GDAX/Coinbase
                            
                                Pandas explode list of dictionaries into rows
                            
                                'argparse' with optional positional arguments that start with dash
                            
                                Django error: UNIQUE constraint failed: auth_user.username
                            
                                Python read csv with Hebrew header
                            
                                Numpy array to vtk table
                            
                                Why doesn't the last command variable "_" appear in dir()? [duplicate]
                            
                                Is it necessary to close the file in json.load?
                            
                                imgradient matlab equivalent in Python
                            
                                How to extract False Positive, False Negative from a confusion matrix of multiclass classification
                            
                                Python 3: unittest.mock how to specify different return values for specific inputs?
                            
                                'Image not found' Error After Installing OpenCV Python Wheel on Mac

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With