Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run curl command on each line of a file and fetch data from result

Tags:

regex

bash

curl

awk

Suppose I have a file containing a list of links of webpages.

www.xyz.com/asdd
www.wer.com/asdas
www.asdas.com/asd
www.asd.com/asdas

I know that doing curl www.xyz.com/asdd will fetch me the html of that webpage. I want to fetch some data from that webpage.

So the scenario is use curl to hit all the links in the file one by one and extract some data from the webpage and store somewhere else. Any ideas or suggestions.

like image 384
aelor Avatar asked Mar 20 '14 15:03

aelor


Video Answer


2 Answers

As indicated in the comments, this will loop through your_file and curl each line:

while IFS= read -r line
do
   curl "$line"
done < your_file

To get the <title> of a page, you can grep something like this:

grep -iPo '(?<=<title>).*(?=</title>)' file

So all together you could do

while IFS= read -r line
do
   curl -s "$line" | grep -Po '(?<=<title>).*(?=</title>)'
done < your_file

Note curl -s is for silent mode. See an example with google page:

$ curl -s http://www.google.com | grep -Po '(?<=<title>).*(?=</title>)'
302 Moved
like image 182
fedorqui 'SO stop harming' Avatar answered Oct 12 '22 05:10

fedorqui 'SO stop harming'


You can accomplish this in just one line with xargs. Let's say you have a file in the working directory with all your URLs (one per line) called sitemap

xargs -I{} curl -s {} <sitemap | grep title

This would extract any lines with the word "title" in it. To extract the title tags you'll want to change the grep a little. The -o flag ensures that only the grepped result is printed:

xargs -I{} curl -s {} <sitemap | grep -o "<title>.*</title>"

A couple of things to note:
  • If you want to extract certain data, you will need to \ escape characters.
    • For HTML attributes for example, you should match single and double quotes, and escape them like [\"\']
  • Sometimes, depending on the character set, you may get some unusual curl output with special characters. If you detect this, you'll need to switch the encoding with a utility like iconv
like image 21
Orun Avatar answered Oct 12 '22 05:10

Orun