Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python dictionary with multiple keys pointing to same list in memory efficient way

Tags:

python

I have this unique requirement which can be explained by this code. This is working code but not memory efficient.

data = [[
        "A 5408599",
        "B 8126880",
        "A 2003529",
    ],
    [
        "C 9925336",
        "C 3705674",
        "A 823678571",
        "C 3205170186",
    ],
    [
        "C 9772980",
        "B 8960327",
        "C 4185139021",
        "D 1226285245",
        "C 2523866271",
        "D 2940954504",
        "D 5083193",
    ]]

temp_dict = {
    item: index for index, sublist in enumerate(data)
        for item in sublist
}

print(data[temp_dict["A 2003529"]])

out: ['A 5408599', 'B 8126880', 'A 2003529']

In short, I want each item of sub-list to be indexable and should return the sublist.

The above method works but It takes a lot of memory when data is large. Is there any better, memory and CPU friendly way? The data is stored as a JSON file.

Edit I tried the answers for the largest possible use case scenario (1000 sublist, 100 items in each sublist, 1 million queries) and here are results (mean of 10 runs):

Method,    Time (seconds),    Extra Memory used
my,        0.637              40 Mb
deceze,    0.63               40 Mb
James,     0.78               200 kb
Pant,      > 300              0 kb
mcsoini,   forever            0 kb
like image 597
Rahul Avatar asked Nov 14 '19 08:11

Rahul


People also ask

Can Python dictionaries have multiple keys?

Python dictionary multiple keys with same name In Python dictionary, if you want to display multiple keys with the same value then you have to use the concept of for loop and list comprehension method. Here we create a list and assign them a value 2 which means the keys will display the same name two times.

How do you store multiple key value pairs in Python?

In Python, we can add multiple key-value pairs to an existing dictionary. This is achieved by using the update() method. This method takes an argument of type dict or any iterable that has the length of two - like ((key1, value1),) , and updates the dictionary with new key-value pairs.

Can a dictionary have two keys with the same value?

Answer. No, each key in a dictionary should be unique. You can't have two keys with the same value. Attempting to use the same key again will just overwrite the previous value stored.

Are Dictionaries space efficient?

There's no significant difference between the size of one large dictionary and multiple smaller ones. See below. @tgambin - as I said in the answer, the space efficiency is due to the creation of multiple dicts. OF COURSE there will be additional space required when when you're allocating more objects.


1 Answers

You are really in a trade-off space between the time/memory it takes to generate the dictionary versus the time it takes to scan the entire data for an on-the-fly method.

If you want a low memory method, you can use a function that searches each sublist for the value. Using a generator will get initial results faster to the user, but for large data sets, this will be slow between returns.

data = [[
        "A 5408599",
        "B 8126880",
        "A 2003529",
    ],
    [
        "C 9925336",
        "C 3705674",
        "A 823678571",
        "C 3205170186",
    ],
    [
        "C 9772980",
        "B 8960327",
        "C 4185139021",
        "D 1226285245",
        "C 2523866271",
        "D 2940954504",
        "D 5083193",
    ]]


def find_list_by_value(v, data):
    for sublist in data:
        if v in sublist:
            yield sublist

for s in find_list_by_value("C 9772980", data):
    print(s)

As mentioned in the comments, building a hash table based just on the first letter or first 2 or 3 character might be a good place to start. This will allow you to build a candidate list of sublists, then scan those to see if the value is in the sublist.

from collections import defaultdict

def get_key(v, size=3):
    return v[:size]

def get_keys(sublist, size=3):
    return set(get_key(v, size) for v in sublist)

def find_list_by_hash(v, data, hash_table, size=3):
    key = get_key(v, size)
    candidate_indices = hash_table.get(key, set())
    for ix in candidates:
        if v in data[ix]:
            yield data[ix]

# generate the small hash table
quick_hash = defaultdict(set)
for i, sublist in enumerate(data):
    for k in get_keys(sublist, 3):
        quick_hash[k].add(i)

# lookup a value by the small hash
for s in find_list_by_hash("C 9772980", data, quick_hash, 3):
    print(s)

In this code quick_hash will take some time to build, because you are scanning your entire data structure. However, the memory foot print will be much smaller. You main parameter for tuning performance is size. Smaller size will have a smaller memory footprint, but will take longer when running find_list_by_hash because your candidate pool will be larger. You can do some testing to see what the right size should be for your data. Just be mindful that all of your values are at least as long as size.

like image 190
James Avatar answered Oct 23 '22 22:10

James