Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

append items from shuffled list to a new list

For a text classification project (age) I'm making a subset of my data. I've made 3 lists with filenames, sorted by age. I want to shuffle these lists and then append 5000 filenames from each shuffled list to a new list. The result should be a data subset with 15000 files (5000 10s, 5000 20s, 5000 30s). Beneath you can see what I wrote so far. But I know that random.shuffle returns none and a none type object is not iterable. How can I solve this problem?

def seed():
   return 0.47231099848

teens = [list of files]
tweens = [list of files]
thirthies = [list of files]
data = []
for categorie in random.shuffle([teens, tweens, thirthies],seed):
    data.append(teens[:5000])
    data.append(tweens[:5000])
    data.append(thirthies[:5000])
like image 306
Bambi Avatar asked Apr 23 '17 11:04

Bambi


People also ask

How do you add all elements from one list to another?

append() adds the new elements as another list, by appending the object to the end. To actually concatenate (add) lists together, and combine all items from one list to another, you need to use the . extend() method.

How can you randomise the items of a list in place in Python?

Python Random shuffle() Method The shuffle() method takes a sequence, like a list, and reorganize the order of the items. Note: This method changes the original list, it does not return a new list.

How do I shuffle the contents of a list in Python?

In Python, you can shuffle (= randomize) a list, string, and tuple with random. shuffle() and random. sample() . random.


2 Answers

First problem is that you are shuffling the list consisting of the 3 items [teens, tweens, thirthies] (even each of the item is a list) instead of shuffling each sublist

Second, you may use random.sample instead of random.shuffle

for categ in [teens, tweens, thirthies]:
    data.append(random.sample(categ,5000])

or as @JonClements suggested in the comments you can use the list comprehension

categories = [teens, tweens, thirthies]
data = [e for categ in categories for e in random.sample(categ, 5000)]
like image 200
Luchko Avatar answered Sep 28 '22 10:09

Luchko


You're correct that random.shuffle returns None. That's because it shuffles its list argument in-place, and it's a Python convention that functions which take a mutable arg and mutate it return None. However, you misunderstand the random arg to random.shuffle: it needs to be a random number generator, not a function like your seed that always returns the same number.

BTW, you can seed the standard random number generator provided by the random module using its seed function. random.seed accepts any hashable object as its argument, although it's customary to pass it a number or string. You can also pass it None (which is equivalent here to not passing it an arg at all), and it will seed the randomiser with the system random source (if there isn't a system random source then the system time is used as the seed). If you don't explicitly call seed after importing the random module, that's equivalent to calling seed()

The benefit of supplying a seed is that each time your run the program with the same seed the random numbers produced by the various random module functions will be exactly the same. This is very useful while developing and debugging your code: it can be hard to track down errors when the output keeps on changing. :)


There are two main ways to do what you want. You can shuffle the lists and then slice the first 5000 file names from them. Or you can use the random.sample function to take 5000 random samples. That way you don't need to shuffle the whole list.

import random

random.seed(0.47231099848)

# teens, tweens, thirties are lists of file names
file_lists = [teens, tweens, thirties]

# Shuffle
data = []
for flist in file_lists:
    random.shuffle(flist)
    data.append(flist[:5000])

Using sample

# Sample
data = []
for flist in file_lists:
    data.append(random.sample(flist, 5000))

I haven't performed speed tests on this code, but I suspect that sample will be faster, since it just need to randomly select items rather than moving all the list items. shuffle is fairly efficient, so you probably wouldn't notice much difference in the run time unless your teens, tweens, and thirties file lists each have a lot more than 5000 file names.

Both of those loops make data a nested list containing 3 sublists, with 5000 file names in each sublist. However, if you want it to be a flat list of 15000 file names you just need to use the list.extend method instead of list.append. Eg,

data = []
for flist in file_lists:
    data.extend(random.sample(flist, 5000))

Or we can do it using a list comprehension with a double for loop:

data = [fname for flist in file_lists for fname in random.sample(flist, 5000)]

If you need to filter the contents of data to build your final file list, the simplest way is to add an if condition to the list comprehension.

Let's say we have a function that can test whether a file name is one we want to keep:

def keep_file(fname):
    # if we want to keep fname, return True, otherwise return False

Then we can do

data = [fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)]

and data will only contain the file names that pass the keep_file test.

Another way to do it is to create the file names using a generator expression instead of a list comprehension and then pass that to the built-in filter function:

data_gen = filter(keep_file, (fname for flist in file_lists for fname in random.sample(flist, 5000)))

data_gen is itself an iterator. You can build a list from it like this:

data_final = list(data_gen)

Or if you don't actually need all the names as a collection and you can just process them one by one, you can put it in a for loop, like this:

for fname in data_gen:
    print(fname)
    # Do other stuff with fname

This uses less RAM, but the downside is that it "consumes" the file names, so once the for loop is finished data_gen will be empty.

Let's assume that you've written a function that extracts the desired data from each file:

def age_and_text(fname):
    # Do stuff that extracts the age and desired text from the file
    return fname, age, text

You could create a list of those (filename, age, text) tuples like this:

data_gen = (fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname))

final_data = [age_and_text(fname) for fname in data_gen]

Notice the slice in my first snippet: flist[:5000]. That takes the first 5000 items in flist, the items with indices 0 to 4999 inclusive. Your version had teens[:5001], which is an off-by-one error. Slices work the same way as ranges. Thus range(5000)yields the 5000 numbers from 0 to 4999. It works this way because Python (like most modern programming languages) uses zero-based indexing.

like image 26
PM 2Ring Avatar answered Sep 28 '22 10:09

PM 2Ring