Just a little disclaimer, I am not very familiar with programming so please excuse me if I'm using any terms incorrectly/in a confusing way.
I want to be able to extract specific information from a webpage and tried doing this by piping the output of a curl function into grep. Oh and this is in cygwin if that matters.
When just typing in
$ curl www.ncbi.nlm.nih.gov/gene/823951
The terminal prints the whole webpage in what I believe to be html. From here I thought I could just pipe this output into a grep function with whatever search term want with:
$ curl www.ncbi.nlm.nih.gov/gene/823951 | grep "Gene Symbol"
But instead of printing the webpage at all, the terminal gives me:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 142k 0 142k 0 0 41857 0 --:--:-- 0:00:03 --:--:-- 42083
Can anyone explain why it does this/how I can search for specific lines of text in a webpage? I eventually want to compile information like gene names, types, and descriptions into a database, so I was hoping to export the results from the grep function into a text file after that.
Any help is extremely appreciated, thanks in advance!
Our browser has been opened and it shows the Html page as output, which was mentioned in the “curl” command. Now, we will use the capital “-O” flag in the curl command to save the Html page into a file without creating a new file name. Hence, try to execute the below query in the terminal of Ubuntu 20.04.
When asking curl to get a URL it'll send the output to stdout by default. You can of course easily change this behavior with options or just using your shell's redirect feature, but without any option it'll spew it out to stdout.
The curl command transfers data to or from a network server, using one of the supported protocols (HTTP, HTTPS, FTP, FTPS, SCP, SFTP, TFTP, DICT, TELNET, LDAP or FILE). It is designed to work without user interaction, so it is ideal for use in a shell script.
What is a flag in Curl? A flag is a command-line parameter that denotes a specific action in Curl. Curl has over three hundred command-line options, and the number of options increases over time.
Curl detects that it is not outputting to a terminal, and shows you the Progress Meter. You can suppress the progress meter with -s.
The HTML data is indeed being sent to grep. However that page does not contain the text "Gene Symbol". Grep is case-sensitive (unless invoked with -i) and you are looking for "Gene symbol".
$ curl -s www.ncbi.nlm.nih.gov/gene/823951 | grep "Gene symbol"
<dt class="noline"> Gene symbol </dt>
You probably also want the next line of HTML, which you can make grep output with the -A option:
$ curl -s www.ncbi.nlm.nih.gov/gene/823951 | grep -A1 "Gene symbol"
<dt class="noline"> Gene symbol </dt>
<dd class="noline">AT3G47960</dd>
See man curl
and man grep
for more information about these and other options.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With