A certain page retrieved from a URL, has the following syntax : <pre class="prettyprint"><code> Name: Pasan Surname: Wijesingher Former/AKA Name: No Former/AKA Name Gender: Male Language Fluency: ENGLISH </code></pre> I want to extract the data in Name, Surname etc. (I have to repeat this task for many pages) For that I tried using the following code: <pre class="prettyprint"><code>import urllib2 url = 'http://www.my.lk/details.aspx?view=1&id=%2031' source = urllib2.urlopen(url) start = 'Given Name:' end = 'Surname' givenName=(source.read().split(start))[1].split(end)[0] start = 'Surname: ' end = 'Former/AKA Name' surname=(source.read().split(start))[1].split(end)[0] print(givenName) print(surname) </code></pre> When I'm calling the source.read.split method only one time it works fine. But when I use it twice it gives a list index out of range error. Can someone suggest a solution?

You can use BeautifulSoup for parsing the HTML string. Here is some code you might try, It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data. <pre class="prettyprint"><code>from bs4 import BeautifulSoup as bs dic = {} data = \ """ Name: Pasan Surname: Wijesingher Former/AKA Name: No Former/AKA Name Gender: Male Language Fluency: ENGLISH """ soup = bs(data) # Get the text on the html through BeautifulSoup text = soup.get_text() # parsing the text lines = text.splitlines() for line in lines: # check if line has ':', if it doesn't, move to the next line if line.find(':') == -1: continue # split the string at ':' parts = line.split(':') # You can add more tests here like # if len(parts) != 2: # continue # stripping whitespace for i in range(len(parts)): parts[i] = parts[i].strip() # adding the vaules to a dictionary dic[parts[0]] = parts[1] # printing the data after processing print '%16s %20s' % (parts[0],parts[1]) </code></pre> A tip: If you are going to use BeautifulSoup to parse HTML, You should have certain attributes like <code>class=input</code> or <code>id=10</code>, That is, you keep all tags of the same type to be the same id or class. <hr> Update Well for your comment, see the code below It applies the tip above, making life (and coding) a lot easier <pre class="prettyprint"><code>from bs4 import BeautifulSoup as bs c_addr = [] id_addr = [] data = \ """ <h2>Primary Location</h2> <div class="address" id="10"> No. 4 Private Drive, Sri Lanka&nbsp;ON&nbsp;&nbsp;K7L LK """ soup = bs(data) for i in soup.find_all('div'): # get data using "class" attribute addr = "" if i.get("class")[0] == u'address': # unicode string text = i.get_text() for line in text.splitlines(): # line-wise line = line.strip() # remove whitespace addr += line # add to address string c_addr.append(addr) # get data using "id" attribute addr = "" if int(i.get("id")) == 10: # integer text = i.get_text() # same processing as above for line in text.splitlines(): line = line.strip() addr += line id_addr.append(addr) print "id_addr" print id_addr print "c_addr" print c_addr </code></pre>

You are calling read() twice. That is the problem. Instead of doing that you want to call read once, store the data in a variable, and use that variable where you were calling read(). Something like this: <pre class="prettyprint"><code>fetched_data = source.read() </code></pre> Then later... <pre class="prettyprint"><code>givenName=(fetched_data.split(start))[1].split(end)[0] </code></pre> and... <pre class="prettyprint"><code>surname=(fetched_data.split(start))[1].split(end)[0] </code></pre> That should work. The reason your code didn't work is because the read() method is reading the content the first time, but after it gets done reading it is looking at the end of the content. The next time you call read() it has no more content remaining and throws an exception. Check out the docs for urllib2 and methods on file objects

Python extracting data from HTML using split

Tags:

python

html-parsing

A certain page retrieved from a URL, has the following syntax :

<p>
    <strong>Name:</strong> Pasan <br/>
    <strong>Surname: </strong> Wijesingher <br/>                    
    <strong>Former/AKA Name:</strong> No Former/AKA Name <br/>                    
    <strong>Gender:</strong> Male <br/>
    <strong>Language Fluency:</strong> ENGLISH <br/>                    
</p>

I want to extract the data in Name, Surname etc. (I have to repeat this task for many pages)

For that I tried using the following code:

import urllib2

url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)

start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]

start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]

print(givenName)
print(surname)

When I'm calling the source.read.split method only one time it works fine. But when I use it twice it gives a list index out of range error.

Can someone suggest a solution?

882

asked Feb 23 '13 05:02

Pasan W.

2 Answers

You can use BeautifulSoup for parsing the HTML string.

Here is some code you might try,
It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data.

from bs4 import BeautifulSoup as bs

dic = {}
data = \
"""
    <p>
        <strong>Name:</strong> Pasan <br/>
        <strong>Surname: </strong> Wijesingher <br/>                    
        <strong>Former/AKA Name:</strong> No Former/AKA Name <br/>                    
        <strong>Gender:</strong> Male <br/>
        <strong>Language Fluency:</strong> ENGLISH <br/>                    
    </p>
"""

soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()

# parsing the text
lines = text.splitlines()
for line in lines:
    # check if line has ':', if it doesn't, move to the next line
    if line.find(':') == -1: 
        continue    
    # split the string at ':'
    parts = line.split(':')

    # You can add more tests here like
    # if len(parts) != 2:
    #     continue

    # stripping whitespace
    for i in range(len(parts)):
        parts[i] = parts[i].strip()    
    # adding the vaules to a dictionary
    dic[parts[0]] = parts[1]
    # printing the data after processing
    print '%16s %20s' % (parts[0],parts[1])

A tip: If you are going to use BeautifulSoup to parse HTML,
You should have certain attributes like class=input or id=10, That is, you keep all tags of the same type to be the same id or class.

Update
Well for your comment, see the code below
It applies the tip above, making life (and coding) a lot easier

from bs4 import BeautifulSoup as bs

c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
    <p>
       No. 4<br>
       Private Drive,<br>
       Sri Lanka&nbsp;ON&nbsp;&nbsp;K7L LK <br>
"""
soup = bs(data)

for i in soup.find_all('div'):
    # get data using "class" attribute
    addr = ""
    if i.get("class")[0] == u'address': # unicode string
        text = i.get_text()
        for line in text.splitlines(): # line-wise
            line = line.strip() # remove whitespace
            addr += line # add to address string
        c_addr.append(addr)

    # get data using "id" attribute
    addr = ""
    if int(i.get("id")) == 10: # integer
        text = i.get_text()
        # same processing as above
        for line in text.splitlines():
            line = line.strip()
            addr += line
        id_addr.append(addr)

print "id_addr"
print id_addr
print "c_addr"
print c_addr

answered Sep 26 '22 21:09

pradyunsg

You are calling read() twice. That is the problem. Instead of doing that you want to call read once, store the data in a variable, and use that variable where you were calling read(). Something like this:

fetched_data = source.read()

Then later...

givenName=(fetched_data.split(start))[1].split(end)[0]

and...

surname=(fetched_data.split(start))[1].split(end)[0]

That should work. The reason your code didn't work is because the read() method is reading the content the first time, but after it gets done reading it is looking at the end of the content. The next time you call read() it has no more content remaining and throws an exception.

Check out the docs for urllib2 and methods on file objects

answered Sep 24 '22 21:09

Matt

Related questions
                            
                                Better understanding of __str__ usage
                            
                                Print Dictionary Keys without Dictionary Name? How/why?
                            
                                How to add multiple files to py2app?
                            
                                How do I plot 3 subplots in the same display window? python
                            
                                Uninstall a previous Installed msi created through cx_freeze bdist_msi
                            
                                Alternative to nesting for loops in Python
                            
                                How to find common keys in a list of dicts and sort them by value?
                            
                                How to remove spaces while writing in INI file- Python
                            
                                How to map 2 lists with comparison in python
                            
                                subprocess checkoutput error
                            
                                Fetchall returning only one column in Python?
                            
                                How do you test an instance method in Python with assertRaises?
                            
                                How to parse XML in Python and LXML?
                            
                                Print confusion
                            
                                write recursive Parser with pyparsing
                            
                                Which Python ternary operation is better and why?
                            
                                How to subtract one from every value in a tuple in Python?
                            
                                py2exe: error: libzmq.pyd: No such file or directory
                            
                                How to use log scale on polar axis in matplotlib
                            
                                Is there any elegant way to build a multi-level dictionary in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With