ruby regex, parsing html

Question

I'm trying to parse some returned html (from http://www.google.com/movies?near=37130 )to look for currently playing movies. The pattern I'm trying to match looks like:
Clash of the Titans

Of which there are several in the returned html.

I'm trying get an array of the movie titles with the following command:
titles = listings_html.split(/().*(<\/span>)/)

But I'm not getting the results I'm expecting. Can anyone see a problem with my approach or regex?

Alice · Accepted Answer

It is considered Verey Bad generally to parse HTML with RegExs since HTML does not have regular grammar. See the list of links to explanations (some from SO) here.

You should instead use a designated HTML library, such as this

tiftik · Answer

I didn't read the whole code you posted since it burned my eyes.

<span>.*</span>

This regex matches hello correctly, but fails at hellothere and matches the whole string. Remember that the * operator is greedy, so it will match the longest string possible. You can make it non-greedy by using .*? should make it work.

However, it's not wise to use regular expressions to parse HTML code.

1- You can't always parse HTML with regex. HTML is not regular.

2- It's very hard to write or maintain regex.

3- It's easy to break the regex by using an input like <a href=""></a>.

ruby regex, parsing html

Tags:

regex

ruby

danwoods

2 Answers

Alice

tiftik

Recent Activity

Donate For Us

ruby regex, parsing html

Tags:

regex

ruby

danwoods

2 Answers

Alice

tiftik

Related questions

Recent Activity

Donate For Us