I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?
You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of @condit)
lynx -dump -hiddenlinks=listonly my.html
You asked for it:
$ wget -O - http://stackoverflow.com | \ grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \ sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
With the Xidel - HTML/XML data extraction tool, this can be done via:
$ xidel --extract "//a/@href" http://example.com/
With conversion to absolute URLs:
$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/
I made a few changes to Greg Bacon Solution
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
An example, since you didn't provide any sample
awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' index.html
You can do it quite easily with the following regex, which is quite good at finding URLs:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
I took it from John Gruber's article on how to find URLs in text.
That lets you find all URLs in a file f.html as follows:
cat f.html | grep -o \
-E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.
OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!
In bash, the following should work. Note that it doesn't use sed or awk, but uses tr
and grep
, both very standard and not perl ;-)
$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
for example:
$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
generates
//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...
Expanding on kerkael's answer:
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
# now adding some more
|grep -v "<a href=\"#"
|grep -v "<a href=\"../"
|grep -v "<a href=\"http"
The first grep I added removes links to local bookmarks.
The second removes relative links to upper levels.
The third removes links that don't start with http.
Pick and choose which one of these you use as per your specific requirements.
This is my first post, so I try to do my best explaining why I post this answer...
So here come the simplest script from GNU grep 2.28:
grep -Po 'href="\K.*?(?=")'
About the \K
switch , not info was founded in MAN and INFO pages, so I came here for the answer....
the \K
switch get rid the previous chars ( and the key itself ).
Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."
Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...
I hope folks you find it very useful.
thanks!!!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With