Given a string i want to count how many substrings with len = 5 i have on it.
For example: Input: "ABCDEFG" Output: 3
And I'm not sure what should be the easiest and fast way to do this in python. Any idea?
Update:
I want only to count different substrings.
Input: "AAAAAA" Substrings: 2 times "AAAAA" Output: 1
>>> n = 5
>>> for s in 'ABCDEF', 'AAAAAA':
... len({s[i:i+n] for i in range(len(s)-n+1)})
...
2
1
To get the sub strings you could use NLTK like this:
>>> from nltk.util import ngrams
>>> for gram in ngrams("ABCDEFG", 5):
... print gram
...
('A', 'B', 'C', 'D', 'E')
('B', 'C', 'D', 'E', 'F')
('C', 'D', 'E', 'F', 'G')
You could apply a Counter and then get the unique n-grams (and their frequency) like so:
>>> Counter(ngrams("AAAAAAA", 5))
Counter({('A', 'A', 'A', 'A', 'A'): 3})
Using list comprehension (code golf):
findSubs=lambda s,v:[''.join([s[i+j] for j in range(v)]) for i,x in enumerate(s) if i<=len(s)-v]
findCount=lambda s,v:len(findSubs(s,v))
print findSubs('ABCDEFG', 5) #returns ['ABCDE', 'BCDEF', 'CDEFG']
print findCount('ABCDEFG', 5) #returns 3
Update
For your update, you could cast the list above to a set, back to a list, then sort the strings.
findUnique=lambda s,v:sorted(list(set(findSubs(s,v))))
findUniqueCount=lambda s,v:len(findUnique(s,v))
print findUnique('AAAAAA', 5) #returns ['AAAAA']
print findUniqueCount('AAAAAA', 5) #returns 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With