Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python on Appengine using BeautifulSoup ImportError: No module named bs4

EDIT2: SOLVED! See answer below regarding proper importing. from lib.bs4 import BeautifulSoup instead of just from bs4 import BeautifulSoup

EDIT: Putting bs4 in the root of the project seems to resolve the issue; however, it isn't an ideal structure. So, I am leaving this question active to try and get to a more robust solution.

A variation of this question has been asked in the past, but the solutions there do not seem to work. I'm unsure if that is because of a change with BeautifulSoup or with Appengine, to be honest.

See: Python 2.7 : How to use BeautifulSoup in Google App Engine?, How to include third party Python libraries in Google App Engine?, and Which version of BeautifulSoup works with GAE (python 2.5)?

The solution proposed by Lipis seems to be adding the 3rd party library to a libs folder in the root of the project then adding the following to the main application:

import sys
sys.path.insert(0, 'libs')

Currently, my structure is this:

ntj-test
├── lib
│   └── bs4 
├── templates
├── main.py
├── get_data.py 
└── app.yaml

Here is my app.yaml:

application: ntj-test
version: 1
runtime: python27
api_version: 1
threadsafe: yes

handlers:
- url: /favicon\.ico
  static_files: favicon.ico
  upload: favicon\.ico

- url: .*
  script: main.app

libraries:
- name: webapp2
  version: latest
- name: jinja2
  version: latest

Here is my main.py:

import webapp2
import jinja2
import get_data
import sys

sys.path.insert(0, 'lib')

JINJA_ENVIRONMENT = jinja2.Environment(
    loader=jinja2.FileSystemLoader('templates'),
    extensions=['jinja2.ext.autoescape'],
    autoescape=True,
)


class MainHandler(webapp2.RequestHandler):
    def get(self):

        teamName = get_data.all_coach_data()[1]
        coachName = get_data.all_coach_data()[2]
        teamKey = get_data.all_coach_data()[0]

        values = {
            'coachName': coachName,
            'teamName': teamName,
            'teamKey': teamKey,
        }

        template = JINJA_ENVIRONMENT.get_template('index.html')
        self.response.write(template.render(values))

app = webapp2.WSGIApplication([
    ('/', MainHandler)
], debug=True)

get_data.py returns the correct data to my variables for populating values, which I have verified in the debugger.

The problem comes when launching main.py in my development environment (I haven't uploaded to gcloud yet). Without fail, regardless of the nifty tricks I've discovered through the above links or throughout my Google searching, the terminal always returns:

Import Error: No module named bs4

In one of the SO links from above, a commenter says "GAE support only Pure Python Modules. bs4 is not pure because some parts were written in C." I am not sure if this is true or not, and I'm unsure how to verify it. I don't have enough reputation to comment to find out. :(

I have been through the bs4 docs on Crummy's website, I have read all of the related SO questions and answers, and I have tried to glean hints from Appengine's documentation. However, I have been unable to find a solution that doesn't involving using the deprecated version of BeautifulSoup, which doesn't have the functionality I need.

I'm a beginner to programming and using StackOverflow, so if I have left out some important piece of information or not followed good practices with the question, please let me know. I will edit and add additional information where necessary.

Thank you!

EDITS: I wasn't sure if the get_data code would be overkill, but here it is:

from bs4 import BeautifulSoup
import urllib2, re

teamKeys = {
    'ATL': 'Atlanta Falcons',
    'HOU': 'Houston Texans',
}

def get_all_coaches():
    for key in teamKeys:
        page = urllib2.urlopen("http://www.nfl.com/teams/coaches?coaType=head&team=" + key)
        soup = BeautifulSoup(page)
        return(head_coach(soup))

def head_coach(soup):
    head = soup.select('.coachprofiletext p')[0].text
    position, name = re.split(': ', head)
    return name

def export_coach_data():
    testList = []
    for key in teamKeys:
        page = urllib2.urlopen("http://www.nfl.com/teams/coaches?coaType=head&team=" + key)
        soup = BeautifulSoup(page)
        teamKey = key
        teamName = teamKeys[key]
        headCoach = head_coach(soup)

        t = [
            teamKey,
            teamName,
            str(headCoach),
        ]

        testList.append(t)

    return(testList)

def all_coach_data():
    results = data.export_coach_data()

    ATL = results[0]
    HOU = results[1]

    return ATL

I'd like to point out that this is probably littered with poor execution (I've only been developing in earnest for a couple months in my spare time), but it does return the correct values to my variables in main.

Here is the Appengine Launcher log:

2014-11-05 15:36:53 Running command: "['C:\\Python27\\pythonw.exe', 'C:\\Program Files\\Google\\Cloud SDK\\google-cloud-sdk\\platform\\google_appengine\\dev_appserver.py', '--skip_sdk_update_check=yes', '--port=11080', '--admin_port=8003', u'G:\\projects\\coaches']"
INFO     2014-11-05 15:37:00,119 devappserver2.py:725] Skipping SDK update check.
WARNING  2014-11-05 15:37:00,157 api_server.py:383] Could not initialize images API; you are likely missing the Python "PIL" module.
INFO     2014-11-05 15:37:00,190 api_server.py:171] Starting API server at: http://localhost:19713
INFO     2014-11-05 15:37:00,210 dispatcher.py:183] Starting module "default" running at: http://localhost:11080
INFO     2014-11-05 15:37:00,216 admin_server.py:117] Starting admin server at: http://localhost:8003
ERROR    2014-11-05 20:37:48,726 wsgi.py:262] 

Traceback (most recent call last):

  File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 239, in Handle

    handler = _config_handle.add_wsgi_middleware(self._LoadHandler())

  File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 298, in _LoadHandler

    handler, path, err = LoadObject(self._handler)

  File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 84, in LoadObject

    obj = __import__(path[0])

  File "G:\projects\coaches\main.py", line 3, in <module>

    import get_data

  File "G:\projects\coaches\get_data.py", line 1, in <module>

    from bs4 import BeautifulSoup

ImportError: No module named bs4

INFO     2014-11-05 15:37:48,762 module.py:652] default: "GET / HTTP/1.1" 500 -
like image 246
nicholas Avatar asked Nov 05 '14 19:11

nicholas


2 Answers

EDIT: It has been pointed out that this is a bit of a hack. If so, how can this solution be modified to not require renaming of modules inside BS4?

A couple users over at http://www.reddit.com/r/learnpython helped me solve this problem.

By expanding on the solution proposed by Lipis, we added the following to main.py:

import os, sys

rootdir = os.path.dirname(os.path.abspath(__file__))
lib = os.path.join(rootdir, 'lib')
sys.path.append(lib)

Then, and here's what no one ever mentioned here or in any of the other SO answers, I added "lib.bs4" to all of my import statements, as such:

from lib.bs4 import BeautifulSoup

But, not only that, there were references to bs4 within the bs4 library itself, so I searched for and replaced all of those with lib.bs4.<something>.

Now, finally, my app runs, and the structure is organized. All the credit goes to /u/invalidusemame and /u/prohulaelk.

Hopefully, this post helps someone else stuck in a similar situation. Maybe it should have been obvious that all the imports would need to have the added to the import statement, but it wasn't immediately obvious from all of the answers.

Thank you to everyone who helped troubleshoot!

like image 191
nicholas Avatar answered Nov 15 '22 01:11

nicholas


I believe your issue is a typo in main.py:

sys.path.insert(0, 'lib')

Your directory is libs, not lib.

like image 41
GAEfan Avatar answered Nov 15 '22 01:11

GAEfan