Given a string of characters, I want to create a dictionary of all the n-character substrings contained in the string, where the dictionary key is the substring and the value is a list. The first element of the list is the number of occurrences of the substring, and the second element of the list is a list of starting locations for those occurrences.
For example, with n=3
, the string 'abcdabcxdabc'
results in this dictionary:
{'abc': [3, [0, 4, 9]],
'cda': [1, [2]],
'dab': [2, [3, 8]],
'bcd': [1, [1]],
'cxd': [1, [6]],
'bcx': [1, [5]],
'xda': [1, [7]]}
The code below works and is efficient since it only walks through the string once, but I'm wondering if there is a more elegant and/or more pythonic way to do this, maybe using a dictionary comprehension. I'm pretty new to python and still trying to figure out when it makes sense (or is even possible) to use comprehensions, etc.
text = 'abcdabcxdabc'
n = 3
d = {}
for i in range(len(text) - n + 1):
sub = text[i:i + n]
if sub in d:
d[sub][0] += 1
d[sub][1].append(i)
else:
d[sub] = [1, [i]]
print(d)
Update: Thanks for all the replies. They generally confirm my suspicion that this is too complex to be efficiently implemented in a single comprehension (but thanks to volcano for showing that it is possible if efficiency isn't a concern). Thanks to RemcoGerlich and Ignacio Vazquez-Abrams for pointing me to defaultdict. I'll have to dig into that more. My timing results suggest that there's a little more overhead in initializing a defaultdict compared to a dict, but that the running time might be slightly faster, at least for this example. (Timing results posted in a separate comment below.) Now I'm wondering if there are any situations where dict would be preferred over defaultdict. Also, thanks to Narcolei for pointing me to the timeit functions.
Dictionaries in Python allow us to store a series of mappings between two sets of values, namely, the keys and the values. All items in the dictionary are enclosed within a pair of curly braces {} . Each item in a dictionary is a mapping between a key and a value - called a key-value pair.
Dictionary comprehension is a method for transforming one dictionary into another dictionary. During this transformation, items within the original dictionary can be conditionally included in the new dictionary and each item can be transformed as needed.
Its syntax is the same as List Comprehension. It returns a generator object. A dict comprehension, in contrast, to list and set comprehensions, needs two expressions separated with a colon. The expression can also be tuple in List comprehension and Set comprehension.
The problem is that v[0]
depends on the length or v[1]
, which means that either the operation to generate v[1]
would have to operate twice, or that the dictionary would have to be iterated over in order to fill in v[0]
to replace the dummy value included the first time.
Another problem is that dict comprehensions expect the entire key and value to be available immediately, which means that you would have to run a list comprehension to get all the indexes of the character, which means that the entire operation becomes O(n2).
The only optimization I would make would be to replace the creation of d
so that you don't need to check for key containment.
d = collections.defaultdict(lambda: [0, []])
It is scary, but (I added just offsets, number of occurrences you may get from list of offsets). Yes, it may be done
In [83]: my_str = 'abcdabcxdabc'
In [84]: n=3
In [85]: {substr: [my_str.replace(substr, ' '*n, c).index(substr)
for c in xrange(my_str.count(substr))]
....: for substr in set(my_str[idx:idx+n] for idx in xrange(len(my_str)-n))}
Out[85]:
{'abc': [0, 4, 9],
'bcd': [1],
'bcx': [5],
'cda': [2],
'cxd': [6],
'dab': [3, 8],
'xda': [7]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With