Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python random lines from subfolders

I have many tasks in .txt files in multiple sub folders. I am trying to pick up a total 10 tasks randomly from these folders, their contained files and finally a text line within a file. The selected line should be deleted or marked so it will be not picked in the next execution. This may be too broad a question but I'd appreciate any input or direction.

Here's the code I have so far:

#!/usr/bin/python  
import random   
with open('C:\\Tasks\\file.txt') as f:  
    lines = random.sample(f.readlines(),10)    
print(lines)
like image 977
user1582596 Avatar asked Aug 26 '12 09:08

user1582596


2 Answers

Here's a simple solution that makes just one pass through the files per sample. If you know exactly how many items you will be sampling from the files, it is probably optimal.

First off is the sample function. This uses the same algorithm that @NedBatchelder linked to in a comment on an earlier answer (though the Perl code shown there only selected a single line, rather than several). It selects values from of an iterable of lines, and only requires the currently selected lines to be kept in memory at any given time (plus the next candidate line). It raises a ValueError if the iterable has fewer values than the requested sample size.

import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results

edit: In another question, user @DzinX noticed that the use of insert in this code makes the performance bad (O(N^2)) if you're sampling a very large number of values. His improved version which avoids that issue is here. /edit

Now we just need to make a suitable iterable of items for our function to sample from. Here's how I'd do it using a generator. This code will only keep one file open at a time, and it does not need more than one line in memory at a time. The optional exclude parameter, if present, should be a set containing lines that have been selected on a previous run (and so should not be yielded again).

import os

def lines_generator(base_folder, exclude = None):
    for dirpath, dirs, files in os.walk(base_folder):
        for filename in files:
            if filename.endswith(".txt"):
                fullPath = os.path.join(dirpath, filename)
                with open(fullPath) as f:
                     for line in f:
                         cleanLine = line.strip()
                         if exclude is None or cleanLine not in exclude:
                             yield cleanLine

Now, we just need a wrapper function to tie those two pieces together (and manage a set of seen lines). It can return a single sample of size n or a list of count samples, taking advantage of the fact that a slice from a random sample is also a random sample.

_seen = set()

def get_sample(n, count = None):
    base_folder = r"C:\Tasks"
    if count is None:
        sample = random_sample(n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return sample
    else:
        sample = random_sample(count * n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return [sample[i * n:(i + 1) * n] for i in range(count)]

Here's how it can be used:

def main():
    s1 = get_sample(10)
    print("Sample1:", *s1, sep="\n")

    s2, s3 = get_sample(10,2) # get two samples with only one read of the files
    print("\nSample2:", *s2, sep="\n")
    print("\nSample3:", *s3, sep="\n")

    s4 = get_sample(5000) # this will probably raise a ValueError!
like image 57
Blckknght Avatar answered Nov 03 '22 18:11

Blckknght


To get a proper random distribution across all these files, you'd need to view them as one big set of lines and pick 10 at random. In other words, you'll have to read all these files at least once to at least figure out how many lines you have.

You do not need to hold all the lines in memory however. You'd have to do this in two phases: index your files to count the number of lines in each, then pick 10 random lines to be read from these files.

First indexing:

import os

root_path = r'C:\Tasks\\'
total_lines = 0
file_indices = dict()

# Based on https://stackoverflow.com/q/845058, bufcount function
def linecount(filename, buf_size=1024*1024):
    with open(filename) as f:
        return sum(buf.count('\n') for buf in iter(lambda: f.read(buf_size), ''))

for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
         if not filename.endswith('.txt'):
             continue
         path = os.path.join(dirpath, filename)
         file_indices[total_lines] = path
         total_lines += linecount(path)

offsets = list(file_indices.keys())
offsets.sort()

Now we have a mapping of offsets, pointing to filenames, and a total line count. Now we pick ten random indices, and read these from your files:

import random
import bisect

tasks = list(range(total_lines))
task_indices = random.sample(tasks, 10)

for index in task_indices:
     # find the closest file index
     file_index = offsets[bisect.bisect(offsets, index) - 1]
     path = file_indices[file_index]
     curr_line = file_index
     with open(path) as f:
         while curr_line <= index:
             task = f.readline()
             curr_line += 1
     print(task)
     tasks.remove(index)

Note that you only need the indexing once; you can store the result somewhere and only update it when your files update.

Also note that your tasks are now 'stored' in the tasks list; these are indices to lines in your files, and I remove the index from that variable when printing the task selected. Next time you run the random.sample() choices, the tasks previously picked will no longer be available for picking the next time. This structure will need updating if your files ever do change, as the indexes have to be re-calculated. The file_indices will help you with that task, but that is outside the scope of this answer. :-)

If you need only one 10-item sample, use Blckknght's solution instead, as it only will go through the files once, while mine require 10 extra file openings. If you need multiple samples, this solution only requires 10 extra file openings every time you need your sample, it won't scan through all the files again. If you have fewer than 10 files, still use Blckknght's answer. :-)

like image 4
Martijn Pieters Avatar answered Nov 03 '22 18:11

Martijn Pieters