Randomly select values from list but with character length restriction

Tags:

python

I have a two lists of strings like the following:

test1 = ["abc", "abcdef", "abcedfhi"]

test2 = ["The", "silver", "proposes", "the", "blushing", "number", "burst", "explores", "the", "fast", "iron", "impossible"]

The second list is longer, so I want to downsample it to the length of the first list by randomly sampling.

def downsample(data):
    min_len = min(len(x) for x in data)
    return [random.sample(x, min_len) for x in data]

downsample([list1, list2])

However, I want to add a restriction that the words chosen from the second list must match the length distribution of the first list. So for the first word that is randomly chosen, it must be of the same length as the first word of the shorter list. The issue here is that replacement is not allowed either.

How can I randomly select n (length of shorter list) elements from test2 which matches the character length distribution of test1? Thanks, Jack

460

asked Jun 16 '18 03:06

Jack Arnestad

1 Answers

Setup

from collections import defaultdict
import random
dct = defaultdict(list)
l1 = ["abc", "abcdef", "abcedfhi"]
l2 = ["The", "silver", "proposes", "the", "blushing", "number", "burst", "explores", "the", "fast", "iron", "impossible"]

First, use collections.defaultdict to create a dictionary where the key is word length:

for word in l2:
  dct[len(word)].append(word)

# Result
defaultdict(<class 'list'>, {3: ['The', 'the', 'the'], 6: ['silver', 'number'], 8: ['proposes', 'blushing', 'explores'], 5: ['burst'], 4: ['fast', 'iron'], 10: ['impossible']})

Then you may use a simple list comprehension along with random.choice to select a random word that matches the length of each element in your first list. If a word length is not found in your dictionary, fill with -1:

final = [random.choice(dct.get(len(w), [-1])) for w in l1]

# Output
['The', 'silver', 'blushing']

Edit based on clarified requirements
Here is an approach that fulfills the requirements of not allowing duplicates if a duplicate does not exist in list 2:

for word in l2:
    dct[len(word)].append(word)

for k in dct:
    random.shuffle(dct[k])

final = [dct[len(w)].pop() for w in l1]
# ['The', 'silver', 'proposes']

This approach will raise an IndexError if not enough words exist in the second list to fulfill the distribution.

117

answered Sep 28 '22 06:09

user3483203

Related questions
                            
                                How to avoid circular imports in a Flask app with Flask SQLAlchemy models?
                            
                                Importing requests into Python using Visual Studio Code
                            
                                Why pandas read_csv issues this warning? (elementwise comparison failed)
                            
                                Is SQL injection protection built into SQLAlchemy's ORM or Core?
                            
                                Install jupyterlab in pip3 throws 'TypeError: expected string or bytes-like object'
                            
                                Convert dict constructor to Pandas MultiIndex dataframe
                            
                                Generalized __eq__() method in Python
                            
                                Check argparse.ArgumentTypeError
                            
                                Serializer validate function is not called DRF
                            
                                Change log-level via mocking
                            
                                How to convert the depth map to 3D point clouds?
                            
                                Reproducing deadlock while using Popen.wait()
                            
                                Where is this warning being raised 'QApplication: invalid style override passed, ignoring it.'?
                            
                                Django JSONField filtering Queryset
                            
                                Python: Hello world with Flask gives me an error related to app.run(debug=True) [duplicate]
                            
                                How to use Vectorization with NumPy arrays to calculate geodesic distance using Geopy library for a large dataset?
                            
                                How to install python packages in a Google Dataproc cluster
                            
                                Python Speech recognition produces bad results
                            
                                How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?
                            
                                Creating Hypertables through SQL Alchemy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With