Optimizing Python code with many attribute and dictionary lookups

Question

I have written a program in Python which spends a large amount of time looking up attributes of objects and values from dictionary keys. I would like to know if there's any way I can optimize these lookup times, potentially with a C extension, to reduce the time of execution, or if I need to simply re-implement the program in a compiled language.

The program implements some algorithms using a graph. It runs prohibitively slowly on our data sets, so I profiled the code with cProfile using a reduced data set that could actually complete. The vast majority of the time is being burned in one function, and specifically in two statements, generator expressions, within the function:

The generator expression at line 202 is

    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)

and the generator expression at line 204 is

    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)

The source code for this function of context provided below.

selected_nodes is a set of nodes in the interaction_graph, which is a NetworkX Graph instance. node_neighbors is an iterator from Graph.neighbors_iter().

Graph itself uses dictionaries for storing nodes and edges. Its Graph.node attribute is a dictionary which stores nodes and their attributes (e.g., 'weight') in dictionaries belonging to each node.

Each of these lookups should be amortized constant time (i.e., O(1)), however, I am still paying a large penalty for the lookups. Is there some way which I can speed up these lookups (e.g., by writing parts of this as a C extension), or do I need to move the program to a compiled language?

Below is the full source code for the function that provides the context; the vast majority of execution time is spent within this function.

def calculate_node_z_prime(
        node,
        interaction_graph,
        selected_nodes
    ):
    """Calculates a z'-score for a given node.

    The z'-score is based on the z-scores (weights) of the neighbors of
    the given node, and proportional to the z-score (weight) of the
    given node. Specifically, we find the maximum z-score of all
    neighbors of the given node that are also members of the given set
    of selected nodes, multiply this z-score by the z-score of the given
    node, and return this value as the z'-score for the given node.

    If the given node has no neighbors in the interaction graph, the
    z'-score is defined as zero.

    Returns the z'-score as zero or a positive floating point value.

    :Parameters:
    - `node`: the node for which to compute the z-prime score
    - `interaction_graph`: graph containing the gene-gene or gene
      product-gene product interactions
    - `selected_nodes`: a `set` of nodes fitting some criterion of
      interest (e.g., annotated with a term of interest)

    """
    node_neighbors = interaction_graph.neighbors_iter(node)
    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)
    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)
    try:
        max_z_score = max(neighbor_z_scores)
    # max() throws a ValueError if its argument has no elements; in this
    # case, we need to set the max_z_score to zero
    except ValueError, e:
        # Check to make certain max() raised this error
        if 'max()' in e.args[0]:
            max_z_score = 0
        else:
            raise e

    z_prime = interaction_graph.node[node]['weight'] * max_z_score
    return z_prime

Here are the top couple of calls according to cProfiler, sorted by time.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
156067701  352.313    0.000  642.072    0.000 bpln_contextual.py:204(<genexpr>)
156067701  289.759    0.000  289.759    0.000 bpln_contextual.py:202(<genexpr>)
 13963893  174.047    0.000  816.119    0.000 {max}
 13963885   69.804    0.000  936.754    0.000 bpln_contextual.py:171(calculate_node_z_prime)
  7116883   61.982    0.000   61.982    0.000 {method 'update' of 'set' objects}

Lie Ryan · Accepted Answer

How about keeping the iteration order of interaction_graph.neighbors_iter(node) sorted (or partially sorted using collections.heapq)? Since you're just trying to find the max value, you can iterate node_neighbors in descending order, the first node that is in selected_node must be the max in selected_node.

Second, how often will selected_node changes? If it changes rarely, you can save a lot of iterations by having a list of "interaction_graph.node[neighbor] for x in selected_node" instead of having to rebuild this list every time.

EDIT: to reply on the comments

A sort() would take O(n log n)

Not necessarily, you're looking too much at your textbook. Despite what your textbook says, you can sometimes break the O(n log n) barrier by exploiting certain structure of your data. If you keep your list of neighbors in a naturally sorted data structure in the first place (e.g. heapq, binary tree), you don't need to re-sort at every iteration. Of course this is a space-time tradeoff, since you will need to store redundant lists of neighbors and there is code complexity to ensure that the list of neighbors is updated when the neighbors changes.

Also, python's list.sort(), which uses timsort algorithm, is very fast for nearly sorted data (could average O(n) in certain cases). It still doesn't break O(n log n), that much has been proven to be mathematically impossible.

You need to profile before dismissing a solution as not likely to improve performance. When doing extreme optimizations, you will likely find that in certain very special edge cases old and slow bubble sort may win over a glorified quicksort or mergesort.

Optimizing Python code with many attribute and dictionary lookups

Tags:

python

dictionary

optimization

profiling

gotgenes

1 Answers

Lie Ryan

Recent Activity

Donate For Us

Optimizing Python code with many attribute and dictionary lookups

Tags:

python

dictionary

optimization

profiling

gotgenes

1 Answers

Lie Ryan

Related questions

Recent Activity

Donate For Us