Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert HTML to plain text and keep basic formatting

I am looking for a way to convert HTML formatted text to plain text while maintaining its basic structure, and perhaps be slightly tweaked, so:

<p>This is a paragraph.</p>
<ol>
  <li>List item 1.</li>
  <li>List item 2.</li>
</ol>
<p>This is an <a href="www.google.com">anchor</a>.</p>

Becomes:

This is a paragraph.

  • List item 1.
  • List item 2.

This is an anchor (www.google.com).

Any ideas on how to effectively achieve for a very large number of HTML-formatted templates?

  • Note that the most important part outside of the structure is keeping the anchors.
like image 338
rebelliard Avatar asked Sep 02 '25 17:09

rebelliard


1 Answers

Use a text-based browser, such as lynx, and have it output to stdout. I'm not sure it will suit all your tweaking-needs, but it's a very quick and easy start

lynx -crawl -dump http://stackoverflow.com/questions/13279364/convert-html-to-plain-text-and-keep-basic-formatting

(actually, I would expect your list to be

1. List item 1.
2. List item 2.

since it's an ordered list)

Edit: actually looked more into your actual use case, it works perfectly:

> echo '<p>This is a paragraph.</p>
<ol>
  <li>List item 1.</li>
  <li>List item 2.</li>
</ol>
<p>This is an <a href="http://www.google.com">anchor</a>.</p>' | lynx -stdin -dump

becomes

   This is a paragraph.
    1. List item 1.
    2. List item 2.

   This is an [1]anchor.

References

   1. http://www.google.com/
like image 57
Claude Avatar answered Sep 05 '25 08:09

Claude