Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to break string in lines only based on \n in python3?

I'm trying to read a text file that contains a lot of non-traditional line breaks.

There are two files, both with 18846 lines. But when I read one of these files in python3 and break into lines, it results in 19010 lines.

This is not repeated either with python2 nor with unix commands like awk 'END {print NR}' file and wc -l. I know that python3 does split the rows based on 12 criteria (named in [1]).

I've tried strategies like using replace:

content = content.replace (u"\v", "")
content = content.replace (u"\x0b", "")
content = content.replace (u"\f", "")
content = content.replace (u"\x0c", "")
content = content.replace (u"\x1c", "")
content = content.replace (u"\x1d", "")
content = content.replace (u"\x1e", "")
content = content.replace (u"\x85", "")
content = content.replace (u"\u2029", "")
content = content.replace (u"\u2028", "")
content = content.replace (u"\u001D", "")

opening files with "rt" and even using ftfy, but no alternative was successful.

Does anyone have any idea how to read the files breaking on lines using the same strategies employed by wc and awk? It may even be altering such a file.

[1] https://docs.python.org/3/library/stdtypes.html#str.splitlines

like image 336
Vítor Mangaravite Avatar asked Mar 19 '19 20:03

Vítor Mangaravite


People also ask

How to split a string by new line character \n in Python?

You can split a string in Python with new line as delimiter in many ways. In this tutorial, we will learn how to split a string by new line character \n in Python using str.split () and re.split () methods. Example 1: Split String by New Line using str.split () In this example, we will take a multiline string string1.

How to handle line breaks in a string in Python?

Handling line breaks in Python (Create, concatenate, split, remove, replace) Create a string containing line breaks. Inserting a newline code n, rn into a string will result in a line break at... Concatenate a list of strings on new lines. You can use the string method join () to concatenate a ...

How to return list of substrings split from string in Python?

The function returns list of substrings split from string based on the regular_expression. Regular Expression + represents one or more adjacent new lines. So, one or more new lines is considered as a separator between splits.

How to concatenate a list of strings on new lines in Python?

Concatenate a list of strings on new lines Split a string into a list by line breaks: splitlines () Output with print () without a trailing newline Inserting a newline code , into a string will result in a line break at that location.


1 Answers

Use io.open and set the newline argument to the line ending of your choice (like \n as in Unix tools):

with io.open(file_path, 'r', encoding='utf8', newline='\n') as sr:
    for line in sr:
        # do stuff

Note you may as well want to remove all other line breaks or replace them with spaces. It is possible to do with a regex like

import re
line = re.sub('[\u000B\u000C\u000D\u0085\u2028\u2029]+', ' ', line)

where the pattern matches one or more chars like

  • \u000B - VT, vertical tab
  • \u000C - FF, form feed
  • \u000D - CR, carriage return
  • \u0085 - NEL, next line (a very frequent one)
  • \u2028 - LSEP, line separator
  • \u2029 - PSEP, paragraph separator
like image 138
Wiktor Stribiżew Avatar answered Oct 30 '22 11:10

Wiktor Stribiżew