I have a csv file which is ~40gb and 1800000 lines.
I want to randomly sample 10,000 lines and print them to a new file.
Right now, my approach is to use sed as:
(sed -n '$vars' < input.txt) > output.txt
Where $vars is a randomly generated list of lines. (Eg: 1p;14p;1700p;...;10203p)
While this works, it takes about 5 minutes per execution. It's not a huge time, but I was wondering if anybody had ideas on how to make it quicker?
We can use the Python random module to help us get a random line from a file. To get all of the lines in a file, first use the read() and splitlines() functions. Then, you can use the random. choice() function to get a random line from the file.
Using sed In the section rnd=$(( 1 + $RANDOM % $(wc -l < example_file. txt) )), we choose a random number from within range of 1 to the number of lines in the file.
The shuf command generates random permutations from input lines to standard output. If given a file or series of files it will shuffle the lines and write the result to standard output. It can also limit the number of results returned supporting selecting random lines from a file or data from a list.
Ways of using shuf command: Suppose file. txt contains 6 lines, then the shuf command displays the input lines in random order as output. Any number of lines can be randomized by using -n option. This will display any two random lines from the file.
The biggest advantage to having lines of the same length is that you don't need to find newlines to know where each line starts. With a file size of ~40GB containing ~1.8M lines, you have a line length of ~20KB/line. If you want to sample 10K lines, you have ~40MB between lines. This is almost certainly around three orders of magnitude larger than the size of a block on your disk. Therefore, seeking to the next read location is much much more efficient than reading every byte in the file.
Seeking will work with files that have unequal line lenghs (e.g., non-ascii characters in UTF-8 encoding), but will require minor modifications to the method. If you have unequal lines, you can seek to an estimated location, then scan to the start of the next line. This is still quite efficient because you will be skipping ~40MB for every ~20KB you need to read. Your sampling uniformity will be compromised slightly since you will select byte locations instead of line locations, and you won't know which line number you are reading for sure.
You can implement your solution directly with the Python code that generates your line numbers. Here is a sample of how to deal with lines that all have the same number of bytes (usually ascii encoding):
import random
from os.path import getsize
# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
file_size = getsize(file_name)
with open(file_name) as file:
    # Read the first line to get the length
    file.readline()
    line_size = file.tell()
    # You don't have to seek(0) here: if line #0 is selected,
    # the seek will happen regardless later.
    # Assuming you are 100% sure all lines are equal, this might
    # discard the last line if it doesn't have a trailing newline.
    # If that bothers you, use `math.round(file_size / line_size)`
    line_count = file_size // line_size
    # This is just a trivial example of how to generate the line numbers.
    # If it doesn't work for you, just use the method you already have.
    # By the way, this will just error out (ValueError) if you try to
    # select more lines than there are in the file, which is ideal
    selection_indices = random.sample(range(line_count), selection_count)
    selection_indices.sort()
    # Now skip to each line before reading it:
    prev_index = 0
    for line_index in selection_indices:
        # Conveniently, the default seek offset is the start of the file,
        # not from current position
        if line_index != prev_index + 1:
            file.seek(line_index * line_size)
        print('Line #{}: {}'.format(line_index, file.readline()), end='')
        # Small optimization to avoid seeking consecutive lines.
        # Might be unnecessary since seek probably already does
        # something like that for you
        prev_index = line_index
If you are willing to sacrifice a (very) small amount of uniformity in the distribution of line numbers, you can easily apply a similar technique to files with unequal line lengths. You just generate random byte offsets, and skip to the next full line after the offset. In the following implementation, it is assumed that you know for a fact that no line is longer than 40KB in length. You would have to do something like this if your CSV had non-ascii unicode characters encoded in UTF-8, because even if the lines all contained the same number of characters, they would contain different numbers of bytes. In this case, you would have to open the file in binary mode, since otherwise you might run into decoding errors when you skip to a random byte, if that byte happens to be mid-character:
import random
from os.path import getsize
# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes, not chars
# This serves two purposes:
#   1. It determines the margin to use from the end of the file
#   2. It determines the closest two offsets are allowed to be and
#      still be 100% guaranteed to be in different lines
max_line_bytes = 40000
file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples, at least `max_line_bytes` apart from each
# other, in the range [0, file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count, file_size, max_line_bytes)
with open(file_name, 'rb') as file:
    for offset in selection_offsets:
        # Skip to each offset
        file.seek(offset)
        # Readout to the next full line
        file.readline()
        # Print the next line. You don't know the number.
        # You also have to decode it yourself.
        print(file.readline().decode('utf-8'), end='')
All code here is Python 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With