I'm trying to read a text file that contains a lot of non-traditional line breaks.
There are two files, both with 18846 lines. But when I read one of these files in python3 and break into lines, it results in 19010 lines.
This is not repeated either with python2 nor with unix commands like awk 'END {print NR}' file
and wc -l
. I know that python3 does split the rows based on 12 criteria (named in [1]).
I've tried strategies like using replace:
content = content.replace (u"\v", "")
content = content.replace (u"\x0b", "")
content = content.replace (u"\f", "")
content = content.replace (u"\x0c", "")
content = content.replace (u"\x1c", "")
content = content.replace (u"\x1d", "")
content = content.replace (u"\x1e", "")
content = content.replace (u"\x85", "")
content = content.replace (u"\u2029", "")
content = content.replace (u"\u2028", "")
content = content.replace (u"\u001D", "")
opening files with "rt" and even using ftfy, but no alternative was successful.
Does anyone have any idea how to read the files breaking on lines using the same strategies employed by wc and awk? It may even be altering such a file.
[1] https://docs.python.org/3/library/stdtypes.html#str.splitlines
You can split a string in Python with new line as delimiter in many ways. In this tutorial, we will learn how to split a string by new line character \n in Python using str.split () and re.split () methods. Example 1: Split String by New Line using str.split () In this example, we will take a multiline string string1.
Handling line breaks in Python (Create, concatenate, split, remove, replace) Create a string containing line breaks. Inserting a newline code n, rn into a string will result in a line break at... Concatenate a list of strings on new lines. You can use the string method join () to concatenate a ...
The function returns list of substrings split from string based on the regular_expression. Regular Expression + represents one or more adjacent new lines. So, one or more new lines is considered as a separator between splits.
Concatenate a list of strings on new lines Split a string into a list by line breaks: splitlines () Output with print () without a trailing newline Inserting a newline code , into a string will result in a line break at that location.
Use io.open
and set the newline
argument to the line ending of your choice (like \n
as in Unix tools):
with io.open(file_path, 'r', encoding='utf8', newline='\n') as sr:
for line in sr:
# do stuff
Note you may as well want to remove all other line breaks or replace them with spaces. It is possible to do with a regex like
import re
line = re.sub('[\u000B\u000C\u000D\u0085\u2028\u2029]+', ' ', line)
where the pattern matches one or more chars like
\u000B
- VT, vertical tab\u000C
- FF, form feed\u000D
- CR, carriage return\u0085
- NEL, next line (a very frequent one)\u2028
- LSEP, line separator\u2029
- PSEP, paragraph separatorIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With