Spider a Website and Return URLs Only

Tags:

I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:

wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?

UPDATE

So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:

wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'

I'd still be interested in other/better means for doing this kind of thing, if any exist.

285

asked May 10 '10 16:05

Rob Wilkerson

3 Answers

The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.

wget --spider --force-html -r -l2 $url 2>&1 \   | grep '^--' | awk '{ print $3 }' \   | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \   > urls.m3u

This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.

The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

102

answered Sep 28 '22 18:09

Rob Wilkerson

Create a few regular expressions to extract the addresses from all

<a href="(ADDRESS_IS_HERE)">.

Here is the solution I would use:

wget -q http://example.com -O - | \     tr "\t\r\n'" '   "' | \     grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \     sed -e 's/^.*"\([^"]\+\)".*$/\1/g'

This will output all http, https, ftp, and ftps links from a webpage. It will not give you relative urls, only full urls.

Explanation regarding the options used in the series of piped commands:

wget -q makes it not have excessive output (quiet mode). wget -O - makes it so that the downloaded file is echoed to stdout, rather than saved to disk.

tr is the unix character translator, used in this example to translate newlines and tabs to spaces, as well as convert single quotes into double quotes so we can simplify our regular expressions.

grep -i makes the search case-insensitive grep -o makes it output only the matching portions.

sed is the Stream EDitor unix utility which allows for filtering and transformation operations.

sed -e just lets you feed it an expression.

Running this little script on "http://craigslist.org" yielded quite a long list of links:

http://blog.craigslist.org/ http://24hoursoncraigslist.com/subs/nowplaying.html http://craigslistfoundation.org/ http://atlanta.craigslist.org/ http://austin.craigslist.org/ http://boston.craigslist.org/ http://chicago.craigslist.org/ http://cleveland.craigslist.org/ ...

answered Sep 28 '22 18:09

Jay Taylor

I've used a tool called xidel

xidel http://server -e '//a/@href' | 
grep -v "http" | 
sort -u | 
xargs -L1 -I {}  xidel http://server/{} -e '//a/@href' | 
grep -v "http" | sort -u

A little hackish but gets you closer! This is only the first level. Imagine packing this up into a self recursive script!

answered Sep 28 '22 18:09

Rick

Related questions
                            
                                Tarballing without Git metadata
                            
                                Is it better to use git grep than plain grep if we want to search in versioned source code?
                            
                                How to do whole-word search similar to "grep -w" in Vim
                            
                                shell variable in a grep regex
                            
                                Redirect stderr to /dev/null
                            
                                Grep output with multiple Colors?
                            
                                Is there an easy way to pass a "raw" string to grep?
                            
                                Linux shell script to add leading zeros to file names
                            
                                Delete a list of files with find and grep
                            
                                How to remove the last character from a bash grep output
                            
                                Can you mass edit all files returned in a grep?
                            
                                grep from tar.gz without extracting [faster one]
                            
                                Always include first line in grep
                            
                                Use grep to find content in files and move them if they match
                            
                                Linux find and grep command together
                            
                                Difference between egrep and grep
                            
                                How to match once per file in grep?
                            
                                Bash, grep between two lines with specified string
                            
                                Waiting for background processes to finish before exiting script
                            
                                grep only text files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spider a Website and Return URLs Only

Tags:

grep

uri

wget

web-crawler

Rob Wilkerson

People also ask

3 Answers

Rob Wilkerson

Jay Taylor

Rick

Recent Activity

Donate For Us