Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python html2text adds random \n

When using the html2text python package to convert html to markdown it adds '\n' to the text. I also see this behaviour when trying the demo at http://www.aaronsw.com/2002/html2text/

Is there any way to turn this off? Of course I can remove them myself, but there might be occurrences of '\n' in the original text which I don't want to remove.

    html2text('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.')

    u'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod\ntempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo\nconsequat. Duis aute irure dolor in reprehenderit in voluptate velit esse\ncillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non\nproident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n\n'
like image 255
arno_v Avatar asked Oct 11 '12 12:10

arno_v


2 Answers

In the latest version of html2text do this:

import html2text
h = html2text.HTML2Text()
h.body_width = 0
note = h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")

This removes the word wrapping that html2text otherwise does

like image 191
Christoffer Avatar answered Sep 27 '22 02:09

Christoffer


Looking at the source to html2text.py, it looks like you can disable the wrapping behavior by setting BODY_WIDTH to 0. Something like this:

import html2text
html2text.BODY_WIDTH = 0
text = html2text.html2text('...')

Of course, resetting BODY_WIDTH globally changes the module's behavior. If I had a need to access this functionality, I'd probably seek to patch the module, creating a parameter to html2text() to modify this behavior per-call, and provide this patch back to the author.

like image 28
zigg Avatar answered Sep 25 '22 02:09

zigg