I have a text file just say
text1 text2 text text
text text text text
I am looking to firstly count the number of strings in the file (all deliminated by space) and then output the first two texts. (text 1 text 2)
Any ideas?
Thanks in advance for the help
Edit: This is what I have so far:
>>> f=open('test.txt')
>>> for line in f:
print line
text1 text2 text text text text hello
>>> words=line.split()
>>> words
['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']
>>> len(words)
7
if len(words) > 2:
print "there are more than 2 words"
The first problem I have is, my text file is: text1 text2 text text text
But when i pull the output of words I get: ['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']
Where does the '\xef\xbb\xbf come from?
Method 1: fileobject.readlines() A file object can be created in Python and then readlines() method can be invoked on this object to read lines into a stream. This method is preferred when a single line or a range of lines from a file needs to be accessed simultaneously.
Use readlines() to Read the range of line from the File The readlines() method reads all lines from a file and stores it in a list. You can use an index number as a line number to extract a set of lines from it. This is the most straightforward way to read a specific line from a file in Python.
To read a file line by line, just loop over the open file object in a for
loop:
for line in open(filename):
# do something with line
To split a line by whitespace into a list of separate words, use str.split()
:
words = line.split()
To count the number of items in a python list, use len(yourlist)
:
count = len(words)
To select the first two items from a python list, use slicing:
firsttwo = words[:2]
I'll leave constructing the complete program to you, but you won't need much more than the above, plus an if
statement to see if you already have your two words.
The three extra bytes you see at the start of your file are the UTF-8 BOM (Byte Order Mark); it marks your file as UTF-8 encoded, but it is redundant and only really used on Windows.
You can remove it with:
import codecs
if line.startswith(codecs.BOM_UTF8):
line = line[3:]
You may want to decode your strings to unicode using that encoding:
line = line.decode('utf-8')
You could also open the file using codecs.open()
:
file = codecs.open(filename, encoding='utf-8')
Note that codecs.open()
will not strip the BOM for you; the easiest way to do that is to use .lstrip()
:
import codecs
BOM = codecs.BOM_UTF8.decode('utf8')
with codecs.open(filename, encoding='utf-8') as f:
for line in f:
line = line.lstrip(BOM)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With