For a text classification project (age) I'm making a subset of my data. I've made 3 lists with filenames, sorted by age. I want to shuffle these lists and then append 5000 filenames from each shuffled list to a new list. The result should be a data subset with 15000 files (5000 10s, 5000 20s, 5000 30s). Beneath you can see what I wrote so far. But I know that random.shuffle
returns none and a none type object is not iterable. How can I solve this problem?
def seed():
return 0.47231099848
teens = [list of files]
tweens = [list of files]
thirthies = [list of files]
data = []
for categorie in random.shuffle([teens, tweens, thirthies],seed):
data.append(teens[:5000])
data.append(tweens[:5000])
data.append(thirthies[:5000])
append() adds the new elements as another list, by appending the object to the end. To actually concatenate (add) lists together, and combine all items from one list to another, you need to use the . extend() method.
Python Random shuffle() Method The shuffle() method takes a sequence, like a list, and reorganize the order of the items. Note: This method changes the original list, it does not return a new list.
In Python, you can shuffle (= randomize) a list, string, and tuple with random. shuffle() and random. sample() . random.
First problem is that you are shuffling the list consisting of the 3 items [teens, tweens, thirthies] (even each of the item is a list) instead of shuffling each sublist
Second, you may use random.sample
instead of random.shuffle
for categ in [teens, tweens, thirthies]:
data.append(random.sample(categ,5000])
or as @JonClements suggested in the comments you can use the list comprehension
categories = [teens, tweens, thirthies]
data = [e for categ in categories for e in random.sample(categ, 5000)]
You're correct that random.shuffle
returns None. That's because it shuffles its list argument in-place, and it's a Python convention that functions which take a mutable arg and mutate it return None
. However, you misunderstand the random
arg to random.shuffle
: it needs to be a random number generator, not a function like your seed
that always returns the same number.
BTW, you can seed the standard random number generator provided by the random module using its seed
function. random.seed
accepts any hashable object as its argument, although it's customary to pass it a number or string. You can also pass it None
(which is equivalent here to not passing it an arg at all), and it will seed the randomiser with the system random source (if there isn't a system random source then the system time is used as the seed). If you don't explicitly call seed
after importing the random module, that's equivalent to calling seed()
The benefit of supplying a seed is that each time your run the program with the same seed the random numbers produced by the various random module functions will be exactly the same. This is very useful while developing and debugging your code: it can be hard to track down errors when the output keeps on changing. :)
There are two main ways to do what you want. You can shuffle the lists and then slice the first 5000 file names from them. Or you can use the random.sample
function to take 5000 random samples. That way you don't need to shuffle the whole list.
import random
random.seed(0.47231099848)
# teens, tweens, thirties are lists of file names
file_lists = [teens, tweens, thirties]
# Shuffle
data = []
for flist in file_lists:
random.shuffle(flist)
data.append(flist[:5000])
Using sample
# Sample
data = []
for flist in file_lists:
data.append(random.sample(flist, 5000))
I haven't performed speed tests on this code, but I suspect that sample
will be faster, since it just need to randomly select items rather than moving all the list items. shuffle
is fairly efficient, so you probably wouldn't notice much difference in the run time unless your teens, tweens, and thirties file lists each have a lot more than 5000 file names.
Both of those loops make data
a nested list containing 3 sublists, with 5000 file names in each sublist. However, if you want it to be a flat list of 15000 file names you just need to use the list.extend
method instead of list.append
. Eg,
data = []
for flist in file_lists:
data.extend(random.sample(flist, 5000))
Or we can do it using a list comprehension with a double for
loop:
data = [fname for flist in file_lists for fname in random.sample(flist, 5000)]
If you need to filter the contents of data
to build your final file list, the simplest way is to add an if
condition to the list comprehension.
Let's say we have a function that can test whether a file name is one we want to keep:
def keep_file(fname):
# if we want to keep fname, return True, otherwise return False
Then we can do
data = [fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)]
and data
will only contain the file names that pass the keep_file
test.
Another way to do it is to create the file names using a generator expression instead of a list comprehension and then pass that to the built-in filter
function:
data_gen = filter(keep_file, (fname for flist in file_lists for fname in random.sample(flist, 5000)))
data_gen
is itself an iterator. You can build a list from it like this:
data_final = list(data_gen)
Or if you don't actually need all the names as a collection and you can just process them one by one, you can put it in a for
loop, like this:
for fname in data_gen:
print(fname)
# Do other stuff with fname
This uses less RAM, but the downside is that it "consumes" the file names, so once the for
loop is finished data_gen
will be empty.
Let's assume that you've written a function that extracts the desired data from each file:
def age_and_text(fname):
# Do stuff that extracts the age and desired text from the file
return fname, age, text
You could create a list of those (filename, age, text)
tuples like this:
data_gen = (fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname))
final_data = [age_and_text(fname) for fname in data_gen]
Notice the slice in my first snippet: flist[:5000]
. That takes the first 5000 items in flist
, the items with indices 0 to 4999 inclusive. Your version had teens[:5001]
, which is an off-by-one error. Slices work the same way as ranges. Thus range(5000)
yields the 5000 numbers from 0 to 4999. It works this way because Python (like most modern programming languages) uses zero-based indexing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With