How to efficiently get count for item in list of lists in python

Tags:

python

I have three lists as follows.

mylist = [[5274919, ["my cat", "little dog", "fish", "rat"]], 
          [5274920, ["my cat", "parrot", "little dog"]], 
          [5274991, ["little dog", "fish", "duck"]]] 
myconcepts = ["my cat", "little dog"]
hatedconcepts = ["rat", "parrot"]

For each concept in myconcepts, I want to get the count every other concept connected to it using mylist. Then remove the hatedconcepts from it. So, my output should looks like as follows.

{"my cat": [("my cat", 2), ("little dog", 2), ("fish", 1)],
"little dog": [("little dog", 3), ("my cat", 2), ("fish", 2), ("duck", 1)]}

I was using this code to do it.

import collections
myoutput = []
for concept in myconcepts:
    mykeywords = []

    for item in mylist:
        if concept in item[1]:
            for mykeyword in item[1]:
                if mykeyword in hatedconcepts:
                    pass
                else:
                    mykeywords.append(mykeyword)
    if len(mykeywords) > 0:            
        sorted_keywords = collections.Counter(mykeywords).most_common()
        myoutput.append(tuple((concept, sorted_keywords)))
print(myoutput)

The output of the code is:

[('my cat', [('my cat', 2), ('little dog', 2), ('fish', 1)]), ('little dog', [('little dog', 3), ('my cat', 2), ('fish', 2), ('duck', 1)])]

However now I have a huge mylist with a size of 3GB and nearly 9000 myconcepts. The hatedconcepts count is only 20. It looks like it takes about two weeks to run using my current code. The main reason for this could be that my current program is O^3 which is not very efficient. Therefore, I am looking for ways to make my current program more efficient. I am even fine with pythonic solutions that even take 5-6 days to run. Please let me know your thoughts.

I have added a portion of mylist in: https://drive.google.com/file/d/1M3EhIRwwKwD3Kv4zDsmXaH1D73tx0eF3/view?usp=sharing just to get some idea how it looks like.

I am happy to provide more details if needed.

543

asked Jun 11 '20 13:06

EmJ

2 Answers

As other comments and answers have indicated, this operation is better handled by Spark or a database. That said, here's my take on it, I introduced some sets operations and minimized repeated loops.

from collections import defaultdict

def get_counts(lst, concepts, hated_concepts):
    result = {concept: defaultdict(int) for concept in concepts}

    concepts_set = set(concepts)
    hated_concepts_set = set(hated_concepts)

    for _, inner_list in lst:
        # ignore hated concepts
        relevant = set(inner_list).difference(hated_concepts_set)

        # determine which concepts need to be updated
        to_update = relevant.intersection(concepts_set)

        for concept in to_update:
            for word in relevant:
                result[concept][word] += 1

    return result

Output is below. You mention the output "must be sorted", but it's unclear to me what the desired sorting is. Some timing tests indicate this is 9x faster than the code you provided on your sample data.

{
    'my cat': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 1, 'little dog': 2}), 
    'little dog': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 2, 'little dog': 3, 'duck': 1})
}

Performance Improvement

emj_functn avg 0.9355s
get_counts avg 0.1141s

Performance testing script:

import random
import string
import time

words = list({
    ''.join(random.choice(string.ascii_lowercase) for _ in range(5))
    for _ in range(1000)
})
test_list = [[random.randint(1e6, 1e7), [random.choice(words) for _ in range(100)]] for _ in range(1000)]
test_concepts = [random.choice(words) for _ in range(100)]
test_hated_concepts = [random.choice(words) for _ in range(50)]


def emj_functn(lst, concepts, hated_concepts):
    ...


def get_counts(lst, concepts, hated_concepts):
    ...


TEST_CASES = 10

start_time = time.time()
for _ in range(TEST_CASES):
    emj_functn(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'emj_functn avg {avg:.4}s')

start_time = time.time()
for _ in range(TEST_CASES):
    get_counts(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'get_counts avg {avg:.4}s')

113

answered Oct 11 '22 15:10

bphi

I have tried to make it fast, avoided some repeated loops. Please check if this speeds things up.

from itertools import chain
from collections import Counter, defaultdict

database = defaultdict(set)
output = {}

# created a map for different concepts, so we only search the indices where a certain concept is
for index, (_, concepts) in enumerate(mylist):
    for concept in concepts:
        database[concept].add(index)

for concept in myconcepts:
    search_indices = database[concept]
    all_counts = Counter(chain.from_iterable(mylist[i][1] for i in search_indices))
    for hc in hatedconcepts:
        if hc in all_counts: all_counts.pop(hc)
    output[concept] = sorted(all_counts.items(), key=lambda x: x[1], reverse=True)

answered Oct 11 '22 13:10

Sayandip Dutta

Related questions
                            
                                IronPython Webframework
                            
                                PyQT GUI Testing
                            
                                Is there a plugin for vim to auto-import python libraries? [closed]
                            
                                What is the easiest way to generate a Control Flow-Graph for a method in Python?
                            
                                semantics of __module__
                            
                                Using FieldList and FormField
                            
                                Optimizing the size of embedded Python interpreter
                            
                                How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]
                            
                                IPython Notebook: how to display() multiple objects without newline
                            
                                Determining if a python subprocess segmentation faults
                            
                                Heroku: Python dependencies in private repos without storing my password
                            
                                Trie tree match performance in word search
                            
                                Python - create an EXE that runs code as written, not as it was when compiled
                            
                                How to check if a dict value contains a word/string? [duplicate]
                            
                                How to create a view of dataframe in pandas?
                            
                                Background tasks in flask
                            
                                Remove only formatting on a cell range selection with google spreadsheet API
                            
                                Import error: you must be root
                            
                                Speed up Spacy Named Entity Recognition
                            
                                Pass and use input (parameters) to a lambda task from a step function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With