For a text classification project (age) I'm making a subset of my data. I've made 3 lists with filenames, sorted by age. I want to shuffle these lists and then append 5000 filenames from each shuffled list to a new list. The result should be a data subset with 15000 files (5000 10s, 5000 20s, 5000 30s). Beneath you can see what I wrote so far. But I know that <code>random.shuffle</code> returns none and a none type object is not iterable. How can I solve this problem? <pre class="prettyprint"><code>def seed(): return 0.47231099848 teens = [list of files] tweens = [list of files] thirthies = [list of files] data = [] for categorie in random.shuffle([teens, tweens, thirthies],seed): data.append(teens[:5000]) data.append(tweens[:5000]) data.append(thirthies[:5000]) </code></pre>

First problem is that you are shuffling the list consisting of the 3 items [teens, tweens, thirthies] (even each of the item is a list) instead of shuffling each sublist Second, you may use <code>random.sample</code> instead of <code>random.shuffle</code> <pre class="prettyprint"><code>for categ in [teens, tweens, thirthies]: data.append(random.sample(categ,5000]) </code></pre> or as @JonClements suggested in the comments you can use the list comprehension <pre class="prettyprint"><code>categories = [teens, tweens, thirthies] data = [e for categ in categories for e in random.sample(categ, 5000)] </code></pre>

You're correct that <code>random.shuffle</code> returns None. That's because it shuffles its list argument in-place, and it's a Python convention that functions which take a mutable arg and mutate it return <code>None</code>. However, you misunderstand the <code>random</code> arg to <code>random.shuffle</code>: it needs to be a random number generator, not a function like your <code>seed</code> that always returns the same number. BTW, you can seed the standard random number generator provided by the random module using its <code>seed</code> function. <code>random.seed</code> accepts any hashable object as its argument, although it's customary to pass it a number or string. You can also pass it <code>None</code> (which is equivalent here to not passing it an arg at all), and it will seed the randomiser with the system random source (if there isn't a system random source then the system time is used as the seed). If you don't explicitly call <code>seed</code> after importing the random module, that's equivalent to calling <code>seed()</code> The benefit of supplying a seed is that each time your run the program with the same seed the random numbers produced by the various random module functions will be exactly the same. This is very useful while developing and debugging your code: it can be hard to track down errors when the output keeps on changing. :) <hr> There are two main ways to do what you want. You can shuffle the lists and then slice the first 5000 file names from them. Or you can use the <code>random.sample</code> function to take 5000 random samples. That way you don't need to shuffle the whole list. <pre class="prettyprint"><code>import random random.seed(0.47231099848) # teens, tweens, thirties are lists of file names file_lists = [teens, tweens, thirties] # Shuffle data = [] for flist in file_lists: random.shuffle(flist) data.append(flist[:5000]) </code></pre> Using <code>sample</code> <pre class="prettyprint"><code># Sample data = [] for flist in file_lists: data.append(random.sample(flist, 5000)) </code></pre> <hr> I haven't performed speed tests on this code, but I suspect that <code>sample</code> will be faster, since it just need to randomly select items rather than moving all the list items. <code>shuffle</code> is fairly efficient, so you probably wouldn't notice much difference in the run time unless your teens, tweens, and thirties file lists each have a lot more than 5000 file names. Both of those loops make <code>data</code> a nested list containing 3 sublists, with 5000 file names in each sublist. However, if you want it to be a flat list of 15000 file names you just need to use the <code>list.extend</code> method instead of <code>list.append</code>. Eg, <pre class="prettyprint"><code>data = [] for flist in file_lists: data.extend(random.sample(flist, 5000)) </code></pre> Or we can do it using a list comprehension with a double <code>for</code> loop: <pre class="prettyprint"><code>data = [fname for flist in file_lists for fname in random.sample(flist, 5000)] </code></pre> <hr> If you need to filter the contents of <code>data</code> to build your final file list, the simplest way is to add an <code>if</code> condition to the list comprehension. Let's say we have a function that can test whether a file name is one we want to keep: <pre class="prettyprint"><code>def keep_file(fname): # if we want to keep fname, return True, otherwise return False </code></pre> Then we can do <pre class="prettyprint"><code>data = [fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)] </code></pre> and <code>data</code> will only contain the file names that pass the <code>keep_file</code> test. Another way to do it is to create the file names using a generator expression instead of a list comprehension and then pass that to the built-in <code>filter</code> function: <pre class="prettyprint"><code>data_gen = filter(keep_file, (fname for flist in file_lists for fname in random.sample(flist, 5000))) </code></pre> <code>data_gen</code> is itself an iterator. You can build a list from it like this: <pre class="prettyprint"><code>data_final = list(data_gen) </code></pre> Or if you don't actually need all the names as a collection and you can just process them one by one, you can put it in a <code>for</code> loop, like this: <pre class="prettyprint"><code>for fname in data_gen: print(fname) # Do other stuff with fname </code></pre> This uses less RAM, but the downside is that it "consumes" the file names, so once the <code>for</code> loop is finished <code>data_gen</code> will be empty. Let's assume that you've written a function that extracts the desired data from each file: <pre class="prettyprint"><code>def age_and_text(fname): # Do stuff that extracts the age and desired text from the file return fname, age, text </code></pre> You could create a list of those <code>(filename, age, text)</code> tuples like this: <pre class="prettyprint"><code>data_gen = (fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)) final_data = [age_and_text(fname) for fname in data_gen] </code></pre> <hr> Notice the slice in my first snippet: <code>flist[:5000]</code>. That takes the first 5000 items in <code>flist</code>, the items with indices 0 to 4999 inclusive. Your version had <code>teens[:5001]</code>, which is an off-by-one error. Slices work the same way as ranges. Thus <code>range(5000)</code>yields the 5000 numbers from 0 to 4999. It works this way because Python (like most modern programming languages) uses zero-based indexing.

append items from shuffled list to a new list

Tags:

python

list

random

append

shuffle

For a text classification project (age) I'm making a subset of my data. I've made 3 lists with filenames, sorted by age. I want to shuffle these lists and then append 5000 filenames from each shuffled list to a new list. The result should be a data subset with 15000 files (5000 10s, 5000 20s, 5000 30s). Beneath you can see what I wrote so far. But I know that random.shuffle returns none and a none type object is not iterable. How can I solve this problem?

def seed():
   return 0.47231099848

teens = [list of files]
tweens = [list of files]
thirthies = [list of files]
data = []
for categorie in random.shuffle([teens, tweens, thirthies],seed):
    data.append(teens[:5000])
    data.append(tweens[:5000])
    data.append(thirthies[:5000])

306

asked Apr 23 '17 11:04

Bambi

2 Answers

First problem is that you are shuffling the list consisting of the 3 items [teens, tweens, thirthies] (even each of the item is a list) instead of shuffling each sublist

Second, you may use random.sample instead of random.shuffle

for categ in [teens, tweens, thirthies]:
    data.append(random.sample(categ,5000])

or as @JonClements suggested in the comments you can use the list comprehension

categories = [teens, tweens, thirthies]
data = [e for categ in categories for e in random.sample(categ, 5000)]

200

answered Sep 28 '22 10:09

Luchko

You're correct that random.shuffle returns None. That's because it shuffles its list argument in-place, and it's a Python convention that functions which take a mutable arg and mutate it return None. However, you misunderstand the random arg to random.shuffle: it needs to be a random number generator, not a function like your seed that always returns the same number.

BTW, you can seed the standard random number generator provided by the random module using its seed function. random.seed accepts any hashable object as its argument, although it's customary to pass it a number or string. You can also pass it None (which is equivalent here to not passing it an arg at all), and it will seed the randomiser with the system random source (if there isn't a system random source then the system time is used as the seed). If you don't explicitly call seed after importing the random module, that's equivalent to calling seed()

The benefit of supplying a seed is that each time your run the program with the same seed the random numbers produced by the various random module functions will be exactly the same. This is very useful while developing and debugging your code: it can be hard to track down errors when the output keeps on changing. :)

There are two main ways to do what you want. You can shuffle the lists and then slice the first 5000 file names from them. Or you can use the random.sample function to take 5000 random samples. That way you don't need to shuffle the whole list.

import random

random.seed(0.47231099848)

# teens, tweens, thirties are lists of file names
file_lists = [teens, tweens, thirties]

# Shuffle
data = []
for flist in file_lists:
    random.shuffle(flist)
    data.append(flist[:5000])

Using sample

# Sample
data = []
for flist in file_lists:
    data.append(random.sample(flist, 5000))

I haven't performed speed tests on this code, but I suspect that sample will be faster, since it just need to randomly select items rather than moving all the list items. shuffle is fairly efficient, so you probably wouldn't notice much difference in the run time unless your teens, tweens, and thirties file lists each have a lot more than 5000 file names.

Both of those loops make data a nested list containing 3 sublists, with 5000 file names in each sublist. However, if you want it to be a flat list of 15000 file names you just need to use the list.extend method instead of list.append. Eg,

data = []
for flist in file_lists:
    data.extend(random.sample(flist, 5000))

Or we can do it using a list comprehension with a double for loop:

data = [fname for flist in file_lists for fname in random.sample(flist, 5000)]

If you need to filter the contents of data to build your final file list, the simplest way is to add an if condition to the list comprehension.

Let's say we have a function that can test whether a file name is one we want to keep:

def keep_file(fname):
    # if we want to keep fname, return True, otherwise return False

Then we can do

data = [fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)]

and data will only contain the file names that pass the keep_file test.

Another way to do it is to create the file names using a generator expression instead of a list comprehension and then pass that to the built-in filter function:

data_gen = filter(keep_file, (fname for flist in file_lists for fname in random.sample(flist, 5000)))

data_gen is itself an iterator. You can build a list from it like this:

data_final = list(data_gen)

Or if you don't actually need all the names as a collection and you can just process them one by one, you can put it in a for loop, like this:

for fname in data_gen:
    print(fname)
    # Do other stuff with fname

This uses less RAM, but the downside is that it "consumes" the file names, so once the for loop is finished data_gen will be empty.

Let's assume that you've written a function that extracts the desired data from each file:

def age_and_text(fname):
    # Do stuff that extracts the age and desired text from the file
    return fname, age, text

You could create a list of those (filename, age, text) tuples like this:

data_gen = (fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname))

final_data = [age_and_text(fname) for fname in data_gen]

Notice the slice in my first snippet: flist[:5000]. That takes the first 5000 items in flist, the items with indices 0 to 4999 inclusive. Your version had teens[:5001], which is an off-by-one error. Slices work the same way as ranges. Thus range(5000)yields the 5000 numbers from 0 to 4999. It works this way because Python (like most modern programming languages) uses zero-based indexing.

answered Sep 28 '22 10:09

PM 2Ring

Related questions
                            
                                Runtime error when trying to logout django
                            
                                pandas - plot sorted column to increasing integer index
                            
                                What are the parentheses for at the end of Python method names? [duplicate]
                            
                                Get a list of file names from HDFS using python
                            
                                os.walk very slow, any way to optimise?
                            
                                Run Web app with Bokeh plots in an offline mode? Where to dl Required Bokeh files
                            
                                python converting video to audio
                            
                                Convert Pandas dataframe to list of list with index, data, and columns
                            
                                To replace but the last occurrence of string in a text [duplicate]
                            
                                Fastest way to find Indexes of item in list?
                            
                                How to filter a Spark dataframe by a boolean column?
                            
                                How to use Keras' multi layer perceptron for multi-class classification
                            
                                How to remove dates from a list in Python
                            
                                Can you have required keyword arguments in Javascript or Python?
                            
                                Speedup GPU vs CPU for matrix operations
                            
                                Pywinauto: unable to bring window to foreground
                            
                                ImportError: No module named 'queue' while running my app freezed with cx_freeze
                            
                                How to parse binary string to dict ?
                            
                                pip and pip3 - both pointing to python3.5?
                            
                                import my database connection with python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With