Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wget grep sed to extract links and save them to a file

I need to download all page links from http://en.wikipedia.org/wiki/Meme and save them to a file all with one command.

First time using the commmand line so I'm unsure of the exact commands, flags, etc to use. I only have a general idea of what to do and had to search around for what href means.

wget http://en.wikipedia.org/wiki/Meme -O links.txt | grep 'href=".*"' | sed -e 's/^.*href=".*".*$/\1/'

The output of the links in the file does not need to be in any specific format.

like image 692
cajole0110 Avatar asked Feb 18 '14 23:02

cajole0110


2 Answers

Using gnu grep:

grep -Po '(?<=href=")[^"]*' links.txt

or with wget

wget http://en.wikipedia.org/wiki/Meme -q -O - |grep -Po '(?<=href=")[^"]*'
like image 140
BMW Avatar answered Sep 26 '22 15:09

BMW


You could use wget's spider mode. See this SO answer for an example.

wget spider

like image 43
Ken Avatar answered Sep 22 '22 15:09

Ken