while calling readlines()
on a .srt file , I got a list of characters with lots of leading and trailing whitespace like below
with open(infile) as f:
r=f.readlines()
return r
I got this list
['\xef\xbb\xbf1\r\n', '00:00:00,000 --> 00:00:03,000\r\n', "[D. Evans] Now that you've written your first Python program,\r\n",'\r\n', '2\r\n', '00:00:03,000 --> 00:00:06,000\r\n', 'you might be wondering why we need to invent new languages like Python\r\n', '\r\n']
I have only included a few elements for brevity..How do I clean this list sothat I can remove all whitespace characters and get only the relevant elements like
['1','00:00:00,000 --> 00:00:03,000',"[D. Evans] Now that you've written your first Python program"...]
You can strip each line. Running it as a generator could also save you some memory if you're working on a big file.
Also, looks like you're working on a UTF-8 file with a BOM (which is sort of silly, or at least unnecessary) for the first several characters, so you need to open it differently.
import codecs
def strip_it_good(file):
with codecs.open(file, "r", "utf-8-sig") as f:
for line in f:
yield line.strip()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With