Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently get count for item in list of lists in python

Tags:

python

I have three lists as follows.

mylist = [[5274919, ["my cat", "little dog", "fish", "rat"]], 
          [5274920, ["my cat", "parrot", "little dog"]], 
          [5274991, ["little dog", "fish", "duck"]]] 
myconcepts = ["my cat", "little dog"]
hatedconcepts = ["rat", "parrot"]

For each concept in myconcepts, I want to get the count every other concept connected to it using mylist. Then remove the hatedconcepts from it. So, my output should looks like as follows.

{"my cat": [("my cat", 2), ("little dog", 2), ("fish", 1)],
"little dog": [("little dog", 3), ("my cat", 2), ("fish", 2), ("duck", 1)]}

I was using this code to do it.

import collections
myoutput = []
for concept in myconcepts:
    mykeywords = []

    for item in mylist:
        if concept in item[1]:
            for mykeyword in item[1]:
                if mykeyword in hatedconcepts:
                    pass
                else:
                    mykeywords.append(mykeyword)
    if len(mykeywords) > 0:            
        sorted_keywords = collections.Counter(mykeywords).most_common()
        myoutput.append(tuple((concept, sorted_keywords)))
print(myoutput)

The output of the code is:

[('my cat', [('my cat', 2), ('little dog', 2), ('fish', 1)]), ('little dog', [('little dog', 3), ('my cat', 2), ('fish', 2), ('duck', 1)])]

However now I have a huge mylist with a size of 3GB and nearly 9000 myconcepts. The hatedconcepts count is only 20. It looks like it takes about two weeks to run using my current code. The main reason for this could be that my current program is O^3 which is not very efficient. Therefore, I am looking for ways to make my current program more efficient. I am even fine with pythonic solutions that even take 5-6 days to run. Please let me know your thoughts.

I have added a portion of mylist in: https://drive.google.com/file/d/1M3EhIRwwKwD3Kv4zDsmXaH1D73tx0eF3/view?usp=sharing just to get some idea how it looks like.

I am happy to provide more details if needed.

like image 543
EmJ Avatar asked Jun 11 '20 13:06

EmJ


People also ask

How do I count a list inside a list in Python?

The most straightforward way to get the number of elements in a list is to use the Python built-in function len() . As the name function suggests, len() returns the length of the list, regardless of the types of elements in it.

How do you count the occurrences of a particular element in the list in Python list we can count the occurrences of an individual element by using a count ()> function?

count() is the in-built function by which python count occurrences in list. It is the easiest among all other methods used to count the occurrence. Count() methods take one argument, i.e., the element for which the number of occurrences is to be counted.

How do you count multiple items in a list Python?

If you want to count multiple items in a list, you can call count() in a loop. This approach, however, requires a separate pass over the list for every count() call; which can be catastrophic for performance. Use couter() method from class collections , instead.


2 Answers

As other comments and answers have indicated, this operation is better handled by Spark or a database. That said, here's my take on it, I introduced some sets operations and minimized repeated loops.

from collections import defaultdict

def get_counts(lst, concepts, hated_concepts):
    result = {concept: defaultdict(int) for concept in concepts}

    concepts_set = set(concepts)
    hated_concepts_set = set(hated_concepts)

    for _, inner_list in lst:
        # ignore hated concepts
        relevant = set(inner_list).difference(hated_concepts_set)

        # determine which concepts need to be updated
        to_update = relevant.intersection(concepts_set)

        for concept in to_update:
            for word in relevant:
                result[concept][word] += 1

    return result

Output is below. You mention the output "must be sorted", but it's unclear to me what the desired sorting is. Some timing tests indicate this is 9x faster than the code you provided on your sample data.

{
    'my cat': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 1, 'little dog': 2}), 
    'little dog': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 2, 'little dog': 3, 'duck': 1})
}

Performance Improvement

emj_functn avg 0.9355s
get_counts avg 0.1141s

Performance testing script:

import random
import string
import time

words = list({
    ''.join(random.choice(string.ascii_lowercase) for _ in range(5))
    for _ in range(1000)
})
test_list = [[random.randint(1e6, 1e7), [random.choice(words) for _ in range(100)]] for _ in range(1000)]
test_concepts = [random.choice(words) for _ in range(100)]
test_hated_concepts = [random.choice(words) for _ in range(50)]


def emj_functn(lst, concepts, hated_concepts):
    ...


def get_counts(lst, concepts, hated_concepts):
    ...


TEST_CASES = 10

start_time = time.time()
for _ in range(TEST_CASES):
    emj_functn(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'emj_functn avg {avg:.4}s')

start_time = time.time()
for _ in range(TEST_CASES):
    get_counts(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'get_counts avg {avg:.4}s')

like image 113
bphi Avatar answered Oct 11 '22 15:10

bphi


I have tried to make it fast, avoided some repeated loops. Please check if this speeds things up.

from itertools import chain
from collections import Counter, defaultdict

database = defaultdict(set)
output = {}

# created a map for different concepts, so we only search the indices where a certain concept is
for index, (_, concepts) in enumerate(mylist):
    for concept in concepts:
        database[concept].add(index)

for concept in myconcepts:
    search_indices = database[concept]
    all_counts = Counter(chain.from_iterable(mylist[i][1] for i in search_indices))
    for hc in hatedconcepts:
        if hc in all_counts: all_counts.pop(hc)
    output[concept] = sorted(all_counts.items(), key=lambda x: x[1], reverse=True)
like image 2
Sayandip Dutta Avatar answered Oct 11 '22 13:10

Sayandip Dutta