Like if I have a string like str1 = "IWantToMasterPython"
If I want to extract "Py"
from the above string. I write:
extractedString = foo("Master","thon")
I want to do all this because i am trying to extract lyrics from an html page. The lyrics are written like <div class = "lyricbox"> ....lyrics goes here....</div>
.
Any suggestions on how can I implement.
The solution is to use a regexp:
import re
r = re.compile('Master(.*?)thon')
m = r.search(str1)
if m:
lyrics = m.group(1)
BeautifulSoup is the easiest way to do what you want. It can be installed like:
sudo easy_install beautifulsoup
The sample code to do what you want is:
from BeautifulSoup import BeautifulSoup
doc = ['<div class="lyricbox">Hey You</div>']
soup = BeautifulSoup(''.join(doc))
print soup.find('div', {'class': 'lyricbox'}).string
You can use Python's urllib to grab content from the url directly. The Beautiful Soup doc is helpful too if you want to do some more parsing.
def foo(s, leader, trailer):
end_of_leader = s.index(leader) + len(leader)
start_of_trailer = s.index(trailer, end_of_leader)
return s[end_of_leader:start_of_trailer]
this raises ValueError if the leader is not present in string s, or the trailer is not present after that (you have not specified what behavior you want in such anomalous conditions; raising an exception is a pretty natural and Pythonic thing to do, letting the caller handle that with a try/except if it know what to do in such cases).
A RE-based approach is also possible, but I think this pure-string approach is simpler.
If you're extracting any data from a html page, I'd strongly suggest using BeautifulSoup library. I used it also for extracting data from html and it works great.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With