Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Easiest way to extract the urls from an html page using sed or awk only

I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.

What is the easiest way to do this?

like image 778
codaddict Avatar asked Dec 10 '09 14:12

codaddict


11 Answers

You could also do something like this (provided you have lynx installed)...

Lynx versions < 2.8.8

lynx -dump -listonly my.html 

Lynx versions >= 2.8.8 (courtesy of @condit)

lynx -dump -hiddenlinks=listonly my.html 
like image 167
Hardy Avatar answered Oct 01 '22 11:10

Hardy


You asked for it:

$ wget -O - http://stackoverflow.com | \   grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \   sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i' 

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

like image 22
Greg Bacon Avatar answered Oct 01 '22 10:10

Greg Bacon


grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
  1. The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
  2. The first sed will add a newline in front of each a href url tag with the \n
  3. The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
  4. The 2nd grep href cleans the mess up
  5. The sort and uniq will give you one instance of each existing url present in the sourcepage.html
like image 41
kerkael Avatar answered Oct 01 '22 12:10

kerkael


With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/
like image 30
Ingo Karkat Avatar answered Oct 01 '22 10:10

Ingo Karkat


I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

  1. We are matching cases where the anchor doesn't start with href as first attribute
  2. We are covering the possibility of having several anchors in the same line
like image 20
Crisboot Avatar answered Oct 01 '22 12:10

Crisboot


An example, since you didn't provide any sample

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html
like image 27
ghostdog74 Avatar answered Oct 01 '22 11:10

ghostdog74


You can do it quite easily with the following regex, which is quite good at finding URLs:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

That lets you find all URLs in a file f.html as follows:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
like image 38
nes1983 Avatar answered Oct 01 '22 10:10

nes1983


I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.

OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!

like image 42
Alok Singhal Avatar answered Oct 01 '22 11:10

Alok Singhal


In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)

$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

for example:

$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

generates

//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...
like image 38
Brad Parks Avatar answered Oct 01 '22 12:10

Brad Parks


Expanding on kerkael's answer:

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
# now adding some more
  |grep -v "<a href=\"#"
  |grep -v "<a href=\"../"
  |grep -v "<a href=\"http"

The first grep I added removes links to local bookmarks.

The second removes relative links to upper levels.

The third removes links that don't start with http.

Pick and choose which one of these you use as per your specific requirements.

like image 33
Nikhil VJ Avatar answered Oct 01 '22 12:10

Nikhil VJ


This is my first post, so I try to do my best explaining why I post this answer...

  1. Since the first 7 most voted answers, 4 include GREP even when the post explicitly says "using sed or awk only".
  2. Even when the post requires "No perl please", due to the previous point, and because use PERL regex inside grep.
  3. and because this is the simplest way ( as far I know , and was required ) to do it in BASH.

So here come the simplest script from GNU grep 2.28:

grep -Po 'href="\K.*?(?=")'

About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer.... the \K switch get rid the previous chars ( and the key itself ). Bear in mind following the advice from man pages: "This is highly experimental and grep -P may warn of unimplemented features."

Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...

I hope folks you find it very useful.

thanks!!!

like image 38
X00D45 Avatar answered Oct 01 '22 11:10

X00D45