I have three lists as follows.
mylist = [[5274919, ["my cat", "little dog", "fish", "rat"]],
[5274920, ["my cat", "parrot", "little dog"]],
[5274991, ["little dog", "fish", "duck"]]]
myconcepts = ["my cat", "little dog"]
hatedconcepts = ["rat", "parrot"]
For each concept in myconcepts
, I want to get the count every other concept connected to it using mylist
. Then remove the hatedconcepts
from it. So, my output should looks like as follows.
{"my cat": [("my cat", 2), ("little dog", 2), ("fish", 1)],
"little dog": [("little dog", 3), ("my cat", 2), ("fish", 2), ("duck", 1)]}
I was using this code to do it.
import collections
myoutput = []
for concept in myconcepts:
mykeywords = []
for item in mylist:
if concept in item[1]:
for mykeyword in item[1]:
if mykeyword in hatedconcepts:
pass
else:
mykeywords.append(mykeyword)
if len(mykeywords) > 0:
sorted_keywords = collections.Counter(mykeywords).most_common()
myoutput.append(tuple((concept, sorted_keywords)))
print(myoutput)
The output of the code is:
[('my cat', [('my cat', 2), ('little dog', 2), ('fish', 1)]), ('little dog', [('little dog', 3), ('my cat', 2), ('fish', 2), ('duck', 1)])]
However now I have a huge mylist
with a size of 3GB and nearly 9000 myconcepts
. The hatedconcepts
count is only 20. It looks like it takes about two weeks to run using my current code. The main reason for this could be that my current program is O^3
which is not very efficient. Therefore, I am looking for ways to make my current program more efficient. I am even fine with pythonic solutions that even take 5-6 days to run. Please let me know your thoughts.
I have added a portion of mylist
in: https://drive.google.com/file/d/1M3EhIRwwKwD3Kv4zDsmXaH1D73tx0eF3/view?usp=sharing just to get some idea how it looks like.
I am happy to provide more details if needed.
The most straightforward way to get the number of elements in a list is to use the Python built-in function len() . As the name function suggests, len() returns the length of the list, regardless of the types of elements in it.
count() is the in-built function by which python count occurrences in list. It is the easiest among all other methods used to count the occurrence. Count() methods take one argument, i.e., the element for which the number of occurrences is to be counted.
If you want to count multiple items in a list, you can call count() in a loop. This approach, however, requires a separate pass over the list for every count() call; which can be catastrophic for performance. Use couter() method from class collections , instead.
As other comments and answers have indicated, this operation is better handled by Spark or a database. That said, here's my take on it, I introduced some sets operations and minimized repeated loops.
from collections import defaultdict
def get_counts(lst, concepts, hated_concepts):
result = {concept: defaultdict(int) for concept in concepts}
concepts_set = set(concepts)
hated_concepts_set = set(hated_concepts)
for _, inner_list in lst:
# ignore hated concepts
relevant = set(inner_list).difference(hated_concepts_set)
# determine which concepts need to be updated
to_update = relevant.intersection(concepts_set)
for concept in to_update:
for word in relevant:
result[concept][word] += 1
return result
Output is below. You mention the output "must be sorted", but it's unclear to me what the desired sorting is. Some timing tests indicate this is 9x faster than the code you provided on your sample data.
{
'my cat': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 1, 'little dog': 2}),
'little dog': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 2, 'little dog': 3, 'duck': 1})
}
emj_functn avg 0.9355s
get_counts avg 0.1141s
Performance testing script:
import random
import string
import time
words = list({
''.join(random.choice(string.ascii_lowercase) for _ in range(5))
for _ in range(1000)
})
test_list = [[random.randint(1e6, 1e7), [random.choice(words) for _ in range(100)]] for _ in range(1000)]
test_concepts = [random.choice(words) for _ in range(100)]
test_hated_concepts = [random.choice(words) for _ in range(50)]
def emj_functn(lst, concepts, hated_concepts):
...
def get_counts(lst, concepts, hated_concepts):
...
TEST_CASES = 10
start_time = time.time()
for _ in range(TEST_CASES):
emj_functn(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'emj_functn avg {avg:.4}s')
start_time = time.time()
for _ in range(TEST_CASES):
get_counts(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'get_counts avg {avg:.4}s')
I have tried to make it fast, avoided some repeated loops. Please check if this speeds things up.
from itertools import chain
from collections import Counter, defaultdict
database = defaultdict(set)
output = {}
# created a map for different concepts, so we only search the indices where a certain concept is
for index, (_, concepts) in enumerate(mylist):
for concept in concepts:
database[concept].add(index)
for concept in myconcepts:
search_indices = database[concept]
all_counts = Counter(chain.from_iterable(mylist[i][1] for i in search_indices))
for hc in hatedconcepts:
if hc in all_counts: all_counts.pop(hc)
output[concept] = sorted(all_counts.items(), key=lambda x: x[1], reverse=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With