Possible Duplicate:
Which CPAN module would you recommend for turning HTML into plain text?
<tt>
, <b>
, <i>
, etc and break-line <br>
, similar to Lynx. For example:
# cat test.html
<body>
<div id="foo" class="blah">
<tt>test<br>
<b>test</b><br>
whatever<br>
test</tt>
</div>
</body>
# lynx.exe --dump test.html
test
test
whatever
test
Note: the second line should be bold.
Lynx is a big program and its html rendering will be non trivial.
How about this:
my $lynx = '/path/to/lynx';
my $html = [ html here ];
my $txt = `$lynx --dump --width 9999 -stdin <<EOF\n$html\nEOF\n`;
Go to search.cpan.org and search for HTML text which will give you lots of options to suit your particular needs. HTML::FormatText is a good baseline, and then branch out into specific variations of it, for example HTML::FormatText::WithLinks if you want to preserve links as footnotes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With