Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dictionary search & insert optimization

Tags:

python

I'm experiencing serious slow downs when working with dictionaries as the dictionary grows to a few thousands keys.

I'm processing a file with ~1,000,000 lines of data, I'm constructing a graph like data structure using dictionaries

here's my bottle neck function

def create_edge(node_a, node_b, graph):
    if node_a not in graph.keys():
        graph[node_a] = {node_b: 1}
    elif node_b in graph[node_a].keys():
        graph[node_a][node_b] += 1
    else:
        graph[node_a][node_b] = 1

create_edge will create and edge from node_a to node_b, or add 1 to the weight of an already existing edge between them.

Since my nodes are identified by a string unique id, I'm using dictionaries for storage, assuming that search if a key exist, and insert would take O(1) on average.

If I comment out create_edge I can process around 20,000 records per seconds, with create_edge as a part of my pipeline it's about 20 records per sec.

The first 100 records takes about 500ms to process. When the dictionary size grows to around 10,000 - Processing 100 records take about 15,000ms, every record process invokes create_edge about 4 times on average - So 400 calls for create_edge takes 15 seconds when the dictionary size is 10,000.

First, do these runtimes make sense? Seems way to much to me, correct me if I'm wrong.

Second, suggestions of optimizing the dictionary usage for better run time would be greatly appreciated.

I'm expecting the dictionary size to be at least 100,000 to complete processing for the entire 1,000,000 records.


Edit: Conclusions

You were right on the money, did two noob mistakes here.. :)

The keys() calls increase complexity dramaticly, taking it from constant time to poly time (squared) per edge insertion, replacing if node in graph.keys() with if node in graph produces a constant process time of 100 records in ~300ms.

Second mistake was virtualenv config which led me to believe im using python3 while I was actualy using python2.

python3 do optimizes the keys() code into a constant time search, which is good for run time but less for proper code style.

Thanks a lot for your assistance.


I performed a run time comparison after removing the keys() calls.

# graph = {}
python version: 3.6.3
start time 11:44:56
Number of records: 1029493
graph created, 1231630 nodes
end time 11:50:35
total ~05:39

# graph = defaultdict(lambda : defaultdict(int))
python version: 3.6.3
start time 11:54:52
Number of records: 1029493
graph created, 1231630 nodes
end time  12:00:34
total ~05:42

# graph = {}
python version: 2.7.10
start time 12:03:25
Number of records: 1029493
graph created, 1231630 nodes
end time 12:09:40
total ~06:15
like image 297
Aviran Avatar asked Mar 07 '23 06:03

Aviran


1 Answers

Can't u just use a defaultdict with defaultdict(int) as seen here: Python: defaultdict of defaultdict?

from collections import defaultdict

graph = defaultdict(lambda : defaultdict(int))

graph['a']['b'] += 1
graph['a']['b'] += 1
graph['a']['c'] += 1

graph

Returns:

defaultdict(<function __main__.<lambda>>,
            {'a': defaultdict(int, {'b': 2, 'c': 1})})
# equal to: {'a': {'b': 2, 'c': 1}}
like image 200
Anton vBR Avatar answered Mar 16 '23 21:03

Anton vBR