I have a file consisting of words, one word on each line. The file looks like this:
aaa
bob
fff
err
ddd
fff
err
I want to count the frequency of the pair of words which occur one after the other.
For example,
aaa,bob: 1
bob,fff:1
fff,err:2
and so on. I have tried this
f=open(file,'r')
content=f.readlines()
f.close()
dic={}
it=iter(content)
for line in content:
print line, next(line);
dic.update({[line,next(line)]: 1})
I got the error:
TypeError: str object is not an iterator
I then tried using an iterator:
it=iter(content)
for x in it:
print x, next(x);
Got the same error again. Please help!
You just need to keep track of the previous line, a file object returns it own iterator so you don't need the iter or readlines at all, call next once at the very start to creating a variable prev then just keep updating prev in the loop:
from collections import defaultdict
d = defaultdict(int)
with open("in.txt") as f:
prev = next(f).strip()
for line in map(str.strip,f): # python2 use itertools.imap
d[prev, line] += 1
prev = line
Which would give you:
defaultdict(<type 'int'>, {('aaa', 'bob'): 1, ('fff', 'err'): 2, ('err', 'ddd'): 1, ('bob', 'fff'): 1, ('ddd', 'fff'): 1})
line
, like all strs
, is an iterable, which means it has an __iter__
method. But next
works with iterators, which have a __next__
method (in Python 2 it's a next
method). When the interpreter executes next(line)
, it attempts to call line.__next__
. Since line
does not have a __next__
method it raises TypeError: str object is not an iterator
.
Since line
is an iterable and has an __iter__
method, we can set it = iter(line)
. it
is an iterator with a __next__
method, and next(it)
returns the next character in line
. But you are looking for the next line in the file, so try something like:
from collections import defaultdict
dic = defaultdict(int)
with open('file.txt') as f:
content = f.readlines()
for i in range(len(content) - 1):
key = content[i].rstrip() + ',' + content[i+1].rstrip()
dic[key] += 1
for k,v in dic.items():
print(k,':',v)
Output (file.txt as in OP)
err,ddd : 1
ddd,fff : 1
aaa,bob : 1
fff,err : 2
bob,fff : 1
from collections import Counter
with open(file, 'r') as f:
content = f.readlines()
result = Counter((a, b) for a, b in zip(content[0:-1], content[1:]))
That will be a dictionary whose keys are the line pairs (in order) and whose values are the number of times that pair occurred.
As others said, line is a string and thus cannot be used with the next() method. Also you can't use a list as a key for the dictionary because they are hashable. You can use a tuple instead. A simple solution:
f=open(file,'r')
content=f.readlines()
f.close()
dic={}
for i in range(len(content)-1):
print(content[i], content[i+1])
try:
dic[(content[i], content[i+1])] += 1
except KeyError:
dic[(content[i], content[i+1])] = 1
Also notice that by using readlines() you also keep the '\n' of each line. You might want to strip it off first:
content = []
with open(file,'r') as f:
for line in f:
content.append(line.strip('\n'))
You can use a 2 line deque and a Counter:
from collections import Counter, deque
lc=Counter()
d=deque(maxlen=2)
with open(fn) as f:
d.append(next(f))
for line in f:
d.append(line)
lc+=Counter(["{},{}".format(*[e.rstrip() for e in d])])
>>> lc
Counter({'fff,err': 2, 'ddd,fff': 1, 'bob,fff': 1, 'aaa,bob': 1, 'err,ddd': 1})
You can also use a regex with a capturing look ahead:
with open(fn) as f:
lc=Counter((m.group(1)+','+m.group(2),) for m in re.finditer(r"(\w+)\n(?=(\w+))", f.read()))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With