Parse HTML with CURL in Shell Script

Question

I'm trying to parse a specific content of a webpage in shell script.

I need to grep the content inside the <div> tag.

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>

If I use grep -E -m 1 -o '<div class="tracklistInfo">', the resume is only <div class="tracklistInfo">

How can I access the Artist (Diplo - Justin Bieber - Skrillex) and how the title (Where Are U Now)?

Casimir et Hippolyte · Accepted Answer

Using xmllint:

a='<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>'

xmllint --html --xpath 'concat(//div[@class="tracklistInfo"]/p[1]/text(), "#", //div[@class="tracklistInfo"]/p[2]/text())' <<<"$a"

You obtain:

Diplo - Justin Bieber - Skrillex#Where Are U Now

That can be easily separated.

Martin Tournoij · Answer

Don't. Use a HTML parser. For example, BeautifulSoup for Python is easy to use and can do this very easily.

That being said, remember that grep works on lines. The pattern is matched for every line, not for the entire string.

What you can use is -A to also print out lines after the match:

grep -A2 -E -m 1 '<div class="tracklistInfo">'

Should output:

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>

You can then get the last or second-last line by piping it to tail:

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
<p>Where Are U Now</p>

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1
<p class="artist">Diplo - Justin Bieber - Skrillex</p>

And strip the HTML with sed:

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
Where Are U Now

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1 | sed 's/<[^>]*>//g'
Diplo - Justin Bieber - Skrillex

But as said, this is fickle, likely to break, and not very pretty. Here's the same with BeautifulSoup, by the way:

html = '''<body>
<p>Blah text</p>
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>
</body>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for track in soup.find_all(class_='tracklistInfo'):
    print(track.find_all('p')[0].text)
    print(track.find_all('p')[1].text)

This also works with multiple rows of tracklistInfo − adding that to the shell command requires more work ;-)

Parse HTML with CURL in Shell Script

Tags:

html

shell

curl

Fab ian

2 Answers

Casimir et Hippolyte

Martin Tournoij

Recent Activity

Donate For Us

Parse HTML with CURL in Shell Script

Tags:

html

shell

curl

Fab ian

2 Answers

Casimir et Hippolyte

Martin Tournoij

Related questions

Recent Activity

Donate For Us