Given a text-string of unknown source, how does one best rewrite it to have a known lineend-convention?
I usually do:
lines = text.splitlines()
text = '\n'.join(lines)
... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!).
The oneliner of what I'm doing is of course:
'\n'.join(text.splitlines())
... that's not what I'm asking about.
The total number of lines should be the same afterwards, so no stripping of empty lines.
Splitting
'a\nb\n\nc\nd'
'a\r\nb\r\n\r\nc\r\nd'
'a\rb\r\rc\rd'
'a\rb\n\rc\rd'
'a\rb\r\nc\nd'
'a\nb\r\nc\rd'
.. should all result in 5 lines. In a mixed context, splitlines assumes that '\r\n' is a single logical newline, leading to 4 lines for the last two testcases.
Hm, a mixed context that contains '\r\n' can be detected by comparing the result of splitlines() and split('\n'), and/or split('\r')...
Normalization using sklearn MinMaxScaler In Python, sklearn module provides an object called MinMaxScaler that normalizes the given data using minimum and maximum values. Here fit_tranform method scales the data between 0 and 1 using the MinMaxScaler object.
Text preprocessing is an important part of Natural Language Processing (NLP), and normalization of text is one step of preprocessing. The goal of normalizing text is to group related tokens together, where tokens are usually the words in the text.
Using MinMaxScaler() to Normalize Data in Python This is a more popular choice for normalizing datasets. You can see that the values in the output are between (0 and 1). MinMaxScaler also gives you the option to select feature range. By default, the range is set to (0,1).
mixed.replace('\r\n', '\n').replace('\r', '\n')
should handle all possible variants.
... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!)
Actually it should work fine:
>>> s = 'hello world\nline 1\r\nline 2'
>>> s.splitlines()
['hello world', 'line 1', 'line 2']
>>> '\n'.join(s.splitlines())
'hello world\nline 1\nline 2'
What version of Python are you using?
EDIT: I still don't see how splitlines()
is not working for you:
>>> s = '''\
... First line, with LF\n\
... Second line, with CR\r\
... Third line, with CRLF\r\n\
... Two blank lines with LFs\n\
... \n\
... \n\
... Two blank lines with CRs\r\
... \r\
... \r\
... Two blank lines with CRLFs\r\n\
... \r\n\
... \r\n\
... Three blank lines with a jumble of things:\r\n\
... \r\
... \r\n\
... \n\
... End without a newline.'''
>>> s.splitlines()
['First line, with LF', 'Second line, with CR', 'Third line, with CRLF', 'Two blank lines with LFs', '', '', 'Two blank lines with CRs', '', '', 'Two blank lines with CRLFs', '', '', 'Three blank lines with a jumble of things:', '', '', '', 'End without a newline.']
>>> print '\n'.join(s.splitlines())
First line, with LF
Second line, with CR
Third line, with CRLF
Two blank lines with LFs
Two blank lines with CRs
Two blank lines with CRLFs
Three blank lines with a jumble of things:
End without a newline.
As far as I know splitlines()
doesn't split the list twice or anything.
Can you paste a sample of the kind of input that's giving you trouble?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With