Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can python dictionary comprehension be used to create a dictionary of substrings and their locations?

Given a string of characters, I want to create a dictionary of all the n-character substrings contained in the string, where the dictionary key is the substring and the value is a list. The first element of the list is the number of occurrences of the substring, and the second element of the list is a list of starting locations for those occurrences.

For example, with n=3, the string 'abcdabcxdabc' results in this dictionary:

{'abc': [3, [0, 4, 9]], 
 'cda': [1, [2]], 
 'dab': [2, [3, 8]], 
 'bcd': [1, [1]], 
 'cxd': [1, [6]], 
 'bcx': [1, [5]], 
 'xda': [1, [7]]}

The code below works and is efficient since it only walks through the string once, but I'm wondering if there is a more elegant and/or more pythonic way to do this, maybe using a dictionary comprehension. I'm pretty new to python and still trying to figure out when it makes sense (or is even possible) to use comprehensions, etc.

text = 'abcdabcxdabc'
n = 3
d = {}
for i in range(len(text) - n + 1):
    sub = text[i:i + n]
    if sub in d:
        d[sub][0] += 1
        d[sub][1].append(i)
    else:
        d[sub] = [1, [i]]
print(d)    

Update: Thanks for all the replies. They generally confirm my suspicion that this is too complex to be efficiently implemented in a single comprehension (but thanks to volcano for showing that it is possible if efficiency isn't a concern). Thanks to RemcoGerlich and Ignacio Vazquez-Abrams for pointing me to defaultdict. I'll have to dig into that more. My timing results suggest that there's a little more overhead in initializing a defaultdict compared to a dict, but that the running time might be slightly faster, at least for this example. (Timing results posted in a separate comment below.) Now I'm wondering if there are any situations where dict would be preferred over defaultdict. Also, thanks to Narcolei for pointing me to the timeit functions.

like image 811
user3065699 Avatar asked Dec 04 '13 12:12

user3065699


People also ask

Why do we use dictionary comprehension in Python?

Dictionaries in Python allow us to store a series of mappings between two sets of values, namely, the keys and the values. All items in the dictionary are enclosed within a pair of curly braces {} . Each item in a dictionary is a mapping between a key and a value - called a key-value pair.

What is a dictionary comprehension in Python?

Dictionary comprehension is a method for transforming one dictionary into another dictionary. During this transformation, items within the original dictionary can be conditionally included in the new dictionary and each item can be transformed as needed.

What is the difference between list comprehension and dictionary comprehension in Python?

Its syntax is the same as List Comprehension. It returns a generator object. A dict comprehension, in contrast, to list and set comprehensions, needs two expressions separated with a colon. The expression can also be tuple in List comprehension and Set comprehension.


2 Answers

The problem is that v[0] depends on the length or v[1], which means that either the operation to generate v[1] would have to operate twice, or that the dictionary would have to be iterated over in order to fill in v[0] to replace the dummy value included the first time.

Another problem is that dict comprehensions expect the entire key and value to be available immediately, which means that you would have to run a list comprehension to get all the indexes of the character, which means that the entire operation becomes O(n2).

The only optimization I would make would be to replace the creation of d so that you don't need to check for key containment.

d = collections.defaultdict(lambda: [0, []])
like image 105
Ignacio Vazquez-Abrams Avatar answered Oct 24 '22 10:10

Ignacio Vazquez-Abrams


It is scary, but (I added just offsets, number of occurrences you may get from list of offsets). Yes, it may be done

In [83]: my_str = 'abcdabcxdabc'

In [84]: n=3

In [85]: {substr: [my_str.replace(substr, ' '*n, c).index(substr) 
                   for c in xrange(my_str.count(substr))]
   ....: for substr in set(my_str[idx:idx+n] for idx in xrange(len(my_str)-n))}
Out[85]: 
{'abc': [0, 4, 9],
 'bcd': [1],
 'bcx': [5],
 'cda': [2],
 'cxd': [6],
 'dab': [3, 8],
 'xda': [7]}
like image 39
volcano Avatar answered Oct 24 '22 11:10

volcano