Given this curl command: curl --user-agent "fogent" --silent -o page.html "http://www.google.com/search?q=insansiate"
* Spelling is intentionally incorrect. I want to grab the suggestion as my result.
I want to be able to either grep into the page.html file perhaps with grep -oE or pipe it right from curl and never store a file.
The result should be: 'instantiate'
I need only the word 'instantiate', or the phrase, whatever google is auto correcting, is what I am after.
Here is the basic html that is returned:
<span class=spell style="color:#cc0000">Did you mean: </span><a href="/search?hl=en&ie=UTF-8&&sa=X&ei=VEMUTMDqGoOINraK3NwL&ved=0CB0QBSgA&q=instantiate&spell=1"class=spell><b><i>instantiate</i></b></a> <span class=std>Top 2 results shown</span>
So perhaps from/to of the string below, which I hope is unique enough to cover all my bases.
class=spell><b><i>instantiate</i></b></a>
I keep running into issues with greedy grep; perhaps I should run it though an html prettify tool first to get a line break or 50 in there. I don't know of any simple way to do so in bash, which is what I would ideally like this to be in. I really don't want to deal with firing up perl, and making sure I have the correct module.
Any suggestions, thank you?
As I'm sure you're aware, screen scraping is a delicate business. This command sequence is no exception since it relies on the specific structure of the page which could change at any time without notice.
grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/' page.html
In a pipe:
curl --user-agent "fogent" --silent "http://www.google.com/search?q=insansiate" | grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/'
This relies on finding five ">" characters between "Did you mean:" and the "</i>
" after the word you're looking for.
Have you considered other methods of getting spelling suggestions or are you specifically interested in what Google provides?
If you have ispell or aspell installed, you can do:
echo insansiate | ispell -a
and parse the result.
xidel is a great utility for scraping web pages; it supports retrieving pages and extracting information in various query languages (CSS selectors, XPath).
In the case at hand, the simple CSS selector a.spell
will do the trick.
xidel --user-agent "fogent" "http://google.com/search?q=insansiate" -e 'a.spell'
Note how xidel
does its own page retrieval, so no need for curl
in this case.
If, however, you needed curl
for more exotic retrieval options, here's how you'd combine the two tools (line break for readability):
curl --user-agent "fogent" --silent "http://google.com/search?q=insansiate" |
xidel - -e 'a.spell'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With