Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to print names in the right way in another function

I've written a script in python to scrape all the names and the links associated with it from the landing page of a website using .get_links() function. Then I've created another function .get_info() to reach another page (using the links derived from the first function) in order to scrape phone numbers from there.

I didn't need to create the second function at all If my goal was to parse the two items from that webpage because they are already available in the landing page.

However, the way I would like my parser to behave is to print the names (carrying forward from the first function) within the second function along with the phone numbers there. Most importantly I do not want to kick out the for loop defined within the second function. If the for loop were not in the second function then the problem would not have arised. Without using for loop I can already get the desired output.

This is my script so far:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "https://potguide.com/alaska/marijuana-dispensaries/"

def get_links(link):
    session = requests.Session()
    session.headers['User-Agent'] = 'Mozilla/5.0'
    r = session.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.select("#StateStores .basic-listing"):
        name = items.select_one("h4 a").text
        namelink = urljoin(link,items.select_one("h4 a").get("href"))  ##making it a fully qualified url
        get_info(session,name,namelink)          ##passing session in order to reuse it

def get_info(session,title,url):
    r = session.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.select("ul.list-unstyled"):  ##if I did not use for loop I could get the output as desired.
        try:
            phone = items.select_one("a[href^='tel:']").text
        except:
            phone = ""
        print(title,phone)

if __name__ == '__main__':
    get_links(url)

The output I'm having:

AK Frost 
AK Frost 
AK Frost 
AK Frost 
AK Frost 
AK Frost (907) 563-9333
AK Frost 
AK Frost 
AK Frost (907) 563-9333
AK Frost  
AK Fuzzy Budz 
AK Fuzzy Budz (907) 644-2838
AK Fuzzy Budz 
AK Fuzzy Budz 
AK Fuzzy Budz (907) 644-2838

My expected output:

AK Frost (907) 563-9333
AK Fuzzy Budz (907) 644-2838
like image 262
SIM Avatar asked May 23 '18 15:05

SIM


4 Answers

If the goal is only to get the expected output, this should work:

def get_info(session,title,url):
    r = session.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.select("ul.list-unstyled"):
        try:
           phone = items.select_one("a[href^='tel:']").text
        except:
           # skip item and continue
           continue  
        else:
           # exception wasn't rised, you have the phone
           print(title,phone)
           break
like image 135
Artem Nepo Avatar answered Sep 22 '22 15:09

Artem Nepo


In my opinion, you should utilise the underlying javascript dictionary which already holds your data (and much more) in a structured format.

You can use yaml to convert the javascript dictionary to a Python dict object. You can easily access from your dictionary fields such as id, name, city, address, city, state, etc.

Here's a working example:

import json, re, requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import yaml

url = "https://potguide.com/alaska/marijuana-dispensaries/"

def get_links(link):
    session = requests.Session()
    session.headers['User-Agent'] = 'Mozilla/5.0'
    r = session.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.select("#StateStores .basic-listing"):
        name = items.select_one("h4 a").text
        namelink = urljoin(link,items.select_one("h4 a").get("href"))
        get_info(session, name, namelink)

def get_info(session, title, url):    
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "lxml")
    script = next((i for i in map(str, soup.find_all("script", type="text/javascript"))
                   if 'mapOptions' in i), None)
    if script:
        js_dict = script.split('__mapOptions = ')[1].split(';\n')[0]
        d = yaml.load(js_dict)
        print(title, d['mapStore']['phone'])

get_links(url)

Result:

AK Frost (907) 563-9333
AK Fuzzy Budz (907) 644-2838
AK Joint (907) 522-5222
AK Slow Burn (907) 868-1450
Alaska Fireweed (907) 258-9333
...
Bad Gramm3r (907) 357-0420
Green Degree (907) 376-3155
Green Jar (907) 631-3800
Rosebuds Shatter House (907) 376-9334
Happy Cannabis (907) 305-0292
like image 40
jpp Avatar answered Sep 21 '22 15:09

jpp


I think the selection of ul.list-unstyled in the subpage is too broad, there are too many of them with content that you don't actually want.

If you really only want the phone numbers, you could directly search for the a tags where the href starts with "tel:". Problem still is that the these sites list multiple numbers this way, usually 2, where one of them is not visible. The one that's visible seems to always be undernath div.col-md-3. I tried this:

def get_info(session,title,url):
    r = session.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for a_phone in soup.select("div.col-md-3 a[href^='tel:']"):        
        print(title, a_phone.text)

and got the following result:

AK Frost (907) 563-9333
AK Fuzzy Budz (907) 644-2838
AK Joint (907) 522-5222
AK Slow Burn (907) 868-1450
Alaska Fireweed (907) 258-9333
Alaskabuds (907) 334-6420
Alaskan Leaf (907) 770-0262
Alaska's Green Light District (907) 644-2839
AM Delight (907) 229-1730
Arctic Herbery (907) 222-1466
Cannabaska (907) 375-9333
Catalyst Cannabis Company (907) 344-0668
Dankorage (907) 279-3265
Enlighten Alaska (907) 290-8559
Great Northern Cannabis (907) 929-9333
Hillside Natural Wellness (907) 868-8639
Hollyweed 907 (907) 929-3331
Raspberry Roots (907) 522-2450
Satori (907) 222-5420
The House of Green (907) 929-3105
Uncle Herb's (907) 561-4372
The Green Spot (907) 354-7044
Denali's Cannabis Cache (907) 683-2633
GOOD (907) 452-5463
Goodsinse (907) 347-7689
Grass Station 49 (907) 374-4420
Green Life Supply (907) 374-4769
One Hit Wonder (844) 420-1448
Pakalolo Supply Company (907) 479-9000
Rebel Roots (907) 455-4055
True Dank (907) 451-4516
The Herbal Cache (907) 783-0420
Denali 420 Recreationals (907) 892-9333
Glacier Valley Shoppe (907) 419-7943
Green Elephant (907) 290-8400
Rainforest Farms (907) 209-2670
The Fireweed Factory (907) 957-2670
Red Run Cannabis Company (907) 283-0800
Cannabis Corner (907) 225-4420
Rainforest Cannabis (907) 247-9333
The Stoney Moose (907) 617-8973
Chena Cannabis (907) 488-0489
The 420 (907) 772-3673
Green Leaf (907) 623-0332
Weed Dudes (907) 623-0605
Remedy Shoppe (907) 983-3345
Fat Tops (907) 953-2470
High Bush Buds (907) 953-9393
Pine Street Cannabis Company (907) 260-3330
Permafrost Distributors (907) 260-7584
Hilltop Premium Green (907) 745-4425
The High Expedition Company (907) 733-0911
Herbal Outfitters (907) 835-4201
Bad Gramm3r (907) 357-0420
Green Degree (907) 376-3155
Green Jar (907) 631-3800
Rosebuds Shatter House (907) 376-9334
Happy Cannabis (907) 305-0292
like image 32
Jeronimo Avatar answered Sep 24 '22 15:09

Jeronimo


You already got enough good answers, but you might try this also:

def get_info(session,title,url):
    r = session.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for items in soup.select("ul.list-unstyled"):
        if len(items.select("a[href^='tel:']")):
            phone = items.select("a[href^='tel:']")[0].text
            break
        else:
            phone = "N/A"
    print(title, phone)

or with some kind of one-liner :)

def get_info(session,title,url):
    r = session.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    phone = ([items.select("a[href^='tel:']")[0].text for items in soup.select("ul.list-unstyled") 
              if len(items.select("a[href^='tel:']"))] + ["N/A"])[0]
    print(title, phone) 

Note that "N/A" is assigned in case no phone number found (e.g. Northern Lights Indoor Gardens N/A)

like image 44
Andersson Avatar answered Sep 23 '22 15:09

Andersson