Quickest way to get list of values from all pages on localhost website

Question

I essentially want to spider my local site and create a list of all the titles and URLs as in:

http://localhost/mySite/Default.aspx      My Home Page
http://localhost/mySite/Preferences.aspx  My Preferences
http://localhost/mySite/Messages.aspx     Messages

I'm running Windows. I'm open to anything that works--a C# console app, PowerShell, some existing tool, etc. We can assume that the tag does exist in the document.

Note: I need to actually spider the files since the title may be set in code rather than markup.

Adam Rosenfield · Accepted Answer

A quick and dirty Cygwin Bash script which does the job:

#!/bin/bash
for file in $(find $WWWROOT -iname \*.aspx); do
  echo -en $file '\t'
  cat $file | tr '\n' ' ' | sed -i 's/.*<title>$[^<]*$</title>.*/\1/'
done

Explanation: this finds every .aspx file under the root directory $WWWROOT, replaces all newlines with spaces so that there are no newlines between the <title> and </title>, and then grabs out the text between those tags.

Quickest way to get list of <title> values from all pages on localhost website

Tags:

screen-scraping

web-crawler

Larsenal

1 Answers

Adam Rosenfield

Recent Activity

Donate For Us