I am currently working on a project to build a web scraper in python, and then dockerize it so that the application can be run on any machine. I have already built the python app, using selenium to load the webpage I am scraping. I am unsure of how to upload the project in docker along with a web driver (like geckodriver) so that it can be run. Do I need to create a container with the application, and link it to another selenium container? Thanks for any help!
My code takes in a list of zip-codes from a text file I have compiled, and uses these codes to scrape in a particular location on a map. Once it has scraped the data, it appends the data to a csv file. I need it to be able to run the application, and then output the csv file to the host machine.
Edit: I have never used docker before, but have done some research on how it works. Please ELI5
First of all you need a Docker Image with all packages installed. Lets create a Dockerfile for this.
FROM ubuntu:bionic
RUN apt-get update && apt-get install -y \
python3 python3-pip \
fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 \
libnspr4 libnss3 lsb-release xdg-utils libxss1 libdbus-glib-1-2 \
curl unzip wget \
xvfb
# install geckodriver and firefox
RUN GECKODRIVER_VERSION=`curl https://github.com/mozilla/geckodriver/releases/latest | grep -Po 'v[0-9]+.[0-9]+.[0-9]+'` && \
wget https://github.com/mozilla/geckodriver/releases/download/$GECKODRIVER_VERSION/geckodriver-$GECKODRIVER_VERSION-linux64.tar.gz && \
tar -zxf geckodriver-$GECKODRIVER_VERSION-linux64.tar.gz -C /usr/local/bin && \
chmod +x /usr/local/bin/geckodriver && \
rm geckodriver-$GECKODRIVER_VERSION-linux64.tar.gz
RUN FIREFOX_SETUP=firefox-setup.tar.bz2 && \
apt-get purge firefox && \
wget -O $FIREFOX_SETUP "https://download.mozilla.org/?product=firefox-latest&os=linux64" && \
tar xjf $FIREFOX_SETUP -C /opt/ && \
ln -s /opt/firefox/firefox /usr/bin/firefox && \
rm $FIREFOX_SETUP
# install chromedriver and google-chrome
RUN CHROMEDRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE` && \
wget https://chromedriver.storage.googleapis.com/$CHROMEDRIVER_VERSION/chromedriver_linux64.zip && \
unzip chromedriver_linux64.zip -d /usr/bin && \
chmod +x /usr/bin/chromedriver && \
rm chromedriver_linux64.zip
RUN CHROME_SETUP=google-chrome.deb && \
wget -O $CHROME_SETUP "https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb" && \
dpkg -i $CHROME_SETUP && \
apt-get install -y -f && \
rm $CHROME_SETUP
# install phantomjs
RUN wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
tar -jxf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
cp phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin/phantomjs && \
rm phantomjs-2.1.1-linux-x86_64.tar.bz2
RUN pip3 install selenium
RUN pip3 install pyvirtualdisplay
RUN pip3 install Selenium-Screenshot
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONUNBUFFERED=1
ENV APP_HOME /usr/src/app
WORKDIR /$APP_HOME
COPY . $APP_HOME/
CMD tail -f /dev/null
CMD python3 example.py
It will run your program in the end. In my case it is example.py
Now place the example.py in the same directory as Dockerfile. An example for Firefox, Chrome and Phantom JS is given below.
import os
import logging
from pyvirtualdisplay import Display
from selenium import webdriver
logging.getLogger().setLevel(logging.INFO)
BASE_URL = 'http://www.example.com/'
def chrome_example():
display = Display(visible=0, size=(800, 600))
display.start()
logging.info('Initialized virtual display..')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_experimental_option('prefs', {
'download.default_directory': os.getcwd(),
'download.prompt_for_download': False,
})
logging.info('Prepared chrome options..')
browser = webdriver.Chrome(chrome_options=chrome_options)
logging.info('Initialized chrome browser..')
browser.get(BASE_URL)
logging.info('Accessed %s ..', BASE_URL)
logging.info('Page title: %s', browser.title)
browser.quit()
display.stop()
def firefox_example():
display = Display(visible=0, size=(800, 600))
display.start()
logging.info('Initialized virtual display..')
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('browser.download.folderList', 2)
firefox_profile.set_preference('browser.download.manager.showWhenStarting', False)
firefox_profile.set_preference('browser.download.dir', os.getcwd())
firefox_profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
logging.info('Prepared firefox profile..')
browser = webdriver.Firefox(firefox_profile=firefox_profile)
logging.info('Initialized firefox browser..')
browser.get(BASE_URL)
logging.info('Accessed %s ..', BASE_URL)
logging.info('Page title: %s', browser.title)
browser.quit()
display.stop()
def phantomjs_example():
display = Display(visible=0, size=(800, 600))
display.start()
logging.info('Initialized virtual display..')
browser = webdriver.PhantomJS()
logging.info('Initialized phantomjs browser..')
browser.get(BASE_URL)
logging.info('Accessed %s ..', BASE_URL)
logging.info('Page title: %s', browser.title)
browser.quit()
display.stop()
if __name__ == '__main__':
chrome_example()
firefox_example()
phantomjs_example()
In the end we will create Docker-compose.yml to simplify things up.
selenium:
build: .
ports:
- 4000:4000
volumes:
- ./data/:/data/
privileged: true
Build and run through following command.
docker-compose build && docker-compose up -d
You can also run it through docker command without using docker-compose
docker build -t selenium_docker .
docker run --privileged -p 4000:4000 -d -it selenium_docker
Source:
https://github.com/dimmg/dockselpy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With