Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert LF to CRLF?

Tags:

python

unix

I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html

How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.

This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard

It should be:

bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard

How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).

I don't know where to start because I've never worked with unicode. Thanks in advance!

Using rU as the parameter (as suggested), with this in my code:

with open(my_file_name, 'rU') as my_file:
    for line in my_file:
        new_words.append(str(line))
my_file.close()

I get this error:

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    addWords('B Words')
  File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
    for line in my_file:
  File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>

Can anyone help me with this?

like image 363
Rushy Panchal Avatar asked Dec 19 '12 14:12

Rushy Panchal


People also ask

How do I convert Linux line ending to Windows?

Converting from Linux to Windows Line Breaks You can use the sed command to convert the file fileLinux. txt to Windows line breaks: The -i option tells sed to write the results back to the input file. The s is sed's substitute command.

How do I know if a file is LF or CR LF?

use a text editor like notepad++ that can help you with understanding the line ends. It will show you the line end formats used as either Unix(LF) or Macintosh(CR) or Windows(CR LF) on the task bar of the tool. you can also go to View->Show Symbol->Show End Of Line to display the line ends as LF/ CR LF/CR.


3 Answers

You can use the replace method of strings. Like

txt.replace('\n', '\r\n')

EDIT :
in your case :

with open('input.txt') as inp, open('output.txt', 'w') as out:
    txt = inp.read()
    txt = txt.replace('\n', '\r\n')
    out.write(txt)
like image 157
dugres Avatar answered Oct 23 '22 07:10

dugres


Instead of converting, you should be able to just open the file using Python's universal newline support:

f = open('words.txt', 'rU')

(Note the U.)

like image 24
NPE Avatar answered Oct 23 '22 06:10

NPE


You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.

The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252 encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252 encoding and that causes a UnicodeDecodeError.

If you change str(line) to line.decode('utf-8') you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bit writeup for some more details.

Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky useful.

like image 42
Eric Rahmig Avatar answered Oct 23 '22 06:10

Eric Rahmig