Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

<br> tag screwing up my data from scraping using beautiful soup and python

I am trying to get a detailed list of golf courses from the a given website. I created a scraper tool to scrape the name and address of different golf courses in the US.

My problem is that in the address that I was able to scrape. I have noticed that there are no space present between the first line of text and second line of text when scraped into my CSV file. In the HTML file I noticed that the two lines of text are separated by <br> tag.

How do I go about that in my code so that the two line of text that I am scraping will provide a space between them when scraped into a CSV?

Here is how the HTML Looks like I am trying to scrape looks like this:

<div class="location">10924 Verterans Memorial Dr<br>Abbeville, Louisiana, United States</div>

And the output of my code that scraped this looks like this:

10924 Verterans Memorial DrAbbeville, Louisiana, United States

Notice that the are no spaces between the "Memorial Dr" and "Abbeville". How do I change it so that it will provide a space when scraped?

Here is my code:

import csv
import requests
from bs4 import BeautifulSoup

courses_list = []
geolocator =  ArcGIS ()

for i in range(1):
    url="http://sites.garmin.com/clsearch/courses/search?course=&location=&country=US&state=&holes=&radius=&lang=en&search_submitted=1&per_page={}".format(i*20)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    #print soup
    g_data2 = soup.find_all("div",{"class":"result"})
    #print g_data2
    for item in g_data2:
        try:
            name = item.find_all("div",{"class":"name"})[0].text
            print name
        except:
            name=''
            print "No Name found!"
        try:
            address= item.find_all("div",{"class":"location"})[0].text
            print address
        except:
            address=''
            print "No Address found!"

course=[name,address]
courses_list.append(course)

with open ('geotest.csv','wb') as file:
     writer=csv.writer(file)
     for row in courses_list:
         writer.writerow(row)
like image 287
Gonzalo68 Avatar asked Nov 20 '25 13:11

Gonzalo68


1 Answers

The text attribute of a BeautifulSoup tag returns a string composed of all child strings of the tag, concatenated using the default separator (an empty string). To substitute a different separator, you can use the get_text() method.

Taking address_tag to be the <div> in question:

>>> print address_tag.get_text(separator=' ')
## 10924 Verterans Memorial Dr Abbeville, Louisiana, United States

or to recreate the multiple lines:

>>> print address_tag.get_text(separator='\n')
## 10924 Verterans Memorial Dr
## Abbeville, Louisiana, United States

You can also accomplish the last result by extracting the strings separately:

>>> for s in at.strings:
...     print s
...
## 10924 Verterans Memorial Dr
## Abbeville, Louisiana, United States
like image 180
tegancp Avatar answered Nov 22 '25 03:11

tegancp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!