Extract a specific string from a curl'd result

Question

Given this curl command: curl --user-agent "fogent" --silent -o page.html "http://www.google.com/search?q=insansiate"

* Spelling is intentionally incorrect. I want to grab the suggestion as my result.

I want to be able to either grep into the page.html file perhaps with grep -oE or pipe it right from curl and never store a file.

The result should be: 'instantiate'

I need only the word 'instantiate', or the phrase, whatever google is auto correcting, is what I am after.

Here is the basic html that is returned:

<span class=spell style="color:#cc0000">Did you mean: </span><a href="/search?hl=en&amp;ie=UTF-8&amp;&amp;sa=X&amp;ei=VEMUTMDqGoOINraK3NwL&amp;ved=0CB0QBSgA&amp;q=instantiate&amp;spell=1"class=spell><b><i>instantiate</i></b></a>&nbsp;&nbsp;<span class=std>Top 2 results shown</span>

So perhaps from/to of the string below, which I hope is unique enough to cover all my bases.

class=spell><b><i>instantiate</i></b></a>&nbsp;&nbsp;

I keep running into issues with greedy grep; perhaps I should run it though an html prettify tool first to get a line break or 50 in there. I don't know of any simple way to do so in bash, which is what I would ideally like this to be in. I really don't want to deal with firing up perl, and making sure I have the correct module.

Any suggestions, thank you?

Dennis Williamson · Accepted Answer

As I'm sure you're aware, screen scraping is a delicate business. This command sequence is no exception since it relies on the specific structure of the page which could change at any time without notice.

grep -o 'Did you mean:$[^>]*>$\{5\}' page.html | sed 's/.*<i>$[^<]*$<.*/\1/' page.html

In a pipe:

curl --user-agent "fogent" --silent "http://www.google.com/search?q=insansiate" | grep -o 'Did you mean:$[^>]*>$\{5\}' page.html | sed 's/.*<i>$[^<]*$<.*/\1/'

This relies on finding five ">" characters between "Did you mean:" and the "</i>" after the word you're looking for.

Have you considered other methods of getting spelling suggestions or are you specifically interested in what Google provides?

If you have ispell or aspell installed, you can do:

echo insansiate | ispell -a

and parse the result.

mklement0 · Answer

xidel is a great utility for scraping web pages; it supports retrieving pages and extracting information in various query languages (CSS selectors, XPath).

In the case at hand, the simple CSS selector a.spell will do the trick.

xidel --user-agent "fogent" "http://google.com/search?q=insansiate" -e 'a.spell'

Note how xidel does its own page retrieval, so no need for curl in this case.

If, however, you needed curl for more exotic retrieval options, here's how you'd combine the two tools (line break for readability):

curl --user-agent "fogent" --silent "http://google.com/search?q=insansiate" |
xidel - -e 'a.spell'

Extract a specific string from a curl'd result

Tags:

grep

bash

user170579

2 Answers

Dennis Williamson

mklement0

Recent Activity

Donate For Us

Extract a specific string from a curl'd result

Tags:

grep

bash

user170579

2 Answers

Dennis Williamson

mklement0

Related questions

Recent Activity

Donate For Us