Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a simple way in linux to strip a website of text from command line?

I've been searching for a command line tool that would turn html code into just the text that would appear on the site... so it would be equivalent to in a web browser selecting everything and then pasting it into a text editor...

Anyone know of something in Ubuntu that would do this? I'm trying to write a script to parse some webpages, but would prefer not to have to deal with the HTML and would prefer to just parse the text that appears on the website.

Thanks,

Dan

like image 525
Dan Avatar asked Feb 24 '10 22:02

Dan


2 Answers

lynx -dump http://example.com/
like image 75
Ignacio Vazquez-Abrams Avatar answered Sep 28 '22 10:09

Ignacio Vazquez-Abrams


if you already have the html file:

lynx -dump file.html > file.txt

otherwise use @Ignacio's

like image 38
John Boker Avatar answered Sep 28 '22 09:09

John Boker