Efficiently counting items in large python lists

Tags:

I have two very large python lists that look like this:

List A: [0,0,0,0,0,0,0,1,1,1,1,2,2,3,3,3,4.........]
List B: [0,0,0,0,0,0,2,2,2,2,3,3,4,4.........]

These lists go on to very large numbers, but I specify a maximum value, say 100 and after that I can discard the rest.

Now I need to calculate for each value (0,1,2..100) the ratio: occurrences in list A / occurrences in list B. And since this value is not always possible I decided to calculate this value only if there's more than 5 occurrences of the value in each list, and if this condition is not true, then combine the occurrences of the previous value(s) and will give the same ratios for combined values if this condition is correct. For example for the above lists, I want to create a Series that looks like this:

0 : 7/6=1.166 
1 : 9/6 = 1.5
2 : 9/6 = 1.5
3 : 9/6 = 1.5
.
.
.
100 : some_number

649

asked Sep 06 '18 12:09

Triple Nipple

1 Answers

You can use a Counter to count the occurences and takewhile to fill your requirement of stopping at 100.

Instead of discarding values which are not in list b, notice how I used nan.

from collections import Counter
from itertools import takewhile

def get_ratios(a, b, max_=None, min_count=0):
    if max_ is not None:
        a = takewhile(lambda x: x <= max_, a)
        b = takewhile(lambda x: x <= max_, b)

    count_a, count_b = Counter(a), Counter(b)

    return {k: float('nan') if not count_b[k] else count_a[k] / count_b[k]
            for k in set(count_a) | set(count_b)
            if count_a[k] >= min_count <= count_b[k]}

Example

a = [1, 1, 1, 2, 3, 101]
b = [1, 1, 2, 2, 4, 101]

print(get_ratios(a, b, max_=100))

Output

{ 1: 1.5,
  2: 0.5,
  3: nan,
  4: 0.0 }

To ignore some under represented values, you can set min_count to 5 as mentionned in your question.

Notice I didn't fill in empty slots with the ratio of the previous value. Unless you have a very specific use case that requires it, I recommend you do not as this would mix actual data with extrapolated data. It is better to default on the previous value when it is not found, but to not pollute the actual data.

answered Oct 07 '22 14:10

Olivier Melançon

Related questions
                            
                                How to get @property methods in asdict?
                            
                                How can I install the pylint for python2.7?
                            
                                Pytest not able to skip testcase in a class via marker skipif
                            
                                How to ignore an invalid SSL certificate with requests_html?
                            
                                Error importing tensorflow in anaconda on Mac OSX
                            
                                Cast dict to defaultdict
                            
                                How to get a list of version numbers for python packages released up until a specific date?
                            
                                Getting outer environment arguments from java using graal python
                            
                                Using Mock in Python for nested objects (DynamoDB and Table)
                            
                                How to pass Passphrase programmatically in Python
                            
                                How to handle query with parameter in python graphene
                            
                                Webcam light still on after cam.release()
                            
                                Filter a GroupBy object where at least 1 row fulfills the condition
                            
                                filling a column values with max value in pandas
                            
                                Pandas Drop Duplicates Series Hashing Error
                            
                                ‘kwargs’ is empty in python decorator
                            
                                Unable to install "Turicreate" on my Windows 10
                            
                                Using a custom PySide2 widget in Qt Designer
                            
                                python openAI retro module
                            
                                Python - CalledProcessError: Command '[...]' returned non-zero exit status 127

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently counting items in large python lists

Tags:

python

algorithm

list

pandas