Suppose I have a file containing a list of links of webpages.
www.xyz.com/asdd
www.wer.com/asdas
www.asdas.com/asd
www.asd.com/asdas
I know that doing curl www.xyz.com/asdd
will fetch me the html of that webpage. I want to fetch some data from that webpage.
So the scenario is use curl to hit all the links in the file one by one and extract some data from the webpage and store somewhere else. Any ideas or suggestions.
As indicated in the comments, this will loop through your_file
and curl
each line:
while IFS= read -r line
do
curl "$line"
done < your_file
To get the <title>
of a page, you can grep
something like this:
grep -iPo '(?<=<title>).*(?=</title>)' file
So all together you could do
while IFS= read -r line
do
curl -s "$line" | grep -Po '(?<=<title>).*(?=</title>)'
done < your_file
Note curl -s
is for silent mode. See an example with google page:
$ curl -s http://www.google.com | grep -Po '(?<=<title>).*(?=</title>)'
302 Moved
You can accomplish this in just one line with xargs
. Let's say you have a file in the working directory with all your URLs (one per line) called sitemap
xargs -I{} curl -s {} <sitemap | grep title
This would extract any lines with the word "title" in it. To extract the title tags you'll want to change the grep
a little. The -o
flag ensures that only the grepped result is printed:
xargs -I{} curl -s {} <sitemap | grep -o "<title>.*</title>"
\
escape characters.
[\"\']
curl
output with special characters. If you detect this, you'll need to switch the encoding with a utility like iconv
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With