Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete html tags in sed or similar

Tags:

html

sed

tags

I am trying to fetch contents of table from a wepage. I jsut need the contents but not the tags <tr></tr>. I don't even need "tr" or "td" just the content. for eg:

<td> I want only this </td>
<tr> and also this </tr>
<TABLE> only texts/numbers in between tags and not the tags. </TABLE>

also I would like to put the first column output like this in a new csv file column1,info1,info2,info3 coumn2,info1,info2,info3

I tried sed to deleted patters <tr> <td> but when I fetch table there are also other tags like <color> <span> etc. so I want is to delete all the tags; in short everything with < and > .

like image 486
user913492 Avatar asked Sep 29 '11 06:09

user913492


People also ask

How do you remove tags in HTML?

For HTML tags, you can press Alt+Enter and select Remove tag instead of removing an opening tag and then a closing tag.

How do I remove a string in HTML?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.


2 Answers

sed 's/<[^>]\+>//g' will strip all tags out, but you might want to replace them with a space so tags that are next to each other don't run together: <td>one</td><td>two</td> becoming: onetwo. So you could do sed 's/<[^>]\+>/ /g' so it would output one two (well, actually one two).

That said unless you need just the raw text, and it sounds like you are trying to do some transformations to the data after stripping the tags, a scripting language like Perl might be a more fitting tool to do this stuff with.

As mu is too short mentioned scraping HTML can be a bit dicey, using something that actually parses the HTML for you would be the best way to do this. PHPs DOM API is pretty good for these kinds of things.

like image 84
Useless Code Avatar answered Oct 14 '22 19:10

Useless Code


Original:

Mac Terminal REGEX behaves a bit differently. I was able to do this on my Mac using the following example:

$ curl google.com | sed 's/<[^>]*>//g'
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   219  100   219    0     0    385      0 --:--:-- --:--:-- --:--:--   385

301 Moved
301 Moved
The document has moved
here.

$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Edit:

Just for clarification sake the origional looked like:

$ curl googl.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

Also the annoying curl header can be rid of using the -s option:

$ curl -s google.com | sed 's/<[^>]*>//g' 

301 Moved
301 Moved
The document has moved
here.

$
like image 24
Robert J Avatar answered Oct 14 '22 18:10

Robert J