Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing the results of Web Scraping into Database

I have written a code for web scraping using python. The code extracts data of Macbook from amazon using selenium. Now I want to store these values in a Excel or MySql. There are various html/css class in a particular product row and one parent class which includes all the parameters of the product. To be Precise the code is:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import xlwt 
from xlwt import Workbook 
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
browser = webdriver.Chrome(executable_path='/home/mukesh/Desktop/backup/Programminghub/whatsapp_python_scripts/chromedriver_linux64/chromedriver', chrome_options=option)
# go to website of interest
browser.get("https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=macbook")
# wait up to 10 seconds for page to load
timeout = 10
try:
    WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//img[@class='s-access-image cfMarker']")))
except TimeoutException:
    print("Timed out waiting for page to load")
    browser.quit()

titles_element = browser.find_elements_by_xpath("//div[@class='s-item-container']")
titles = []
for x in titles_element:
    value=x.text
    value=value.encode('ascii', 'ignore')
    titles.append(value)
print(titles)

Now the output that I get is highly unstructured and contains some parameters which are there only on certain products. For instance the Parameter: "Maximum Resolution" or "CPU model manufacture" are present only on certain laptops and not on all.I don't want such parameters.I want only these parameters: Product name(Title of the row), Price,Operating System,Cpu model family,computer memory size and display size which are present on all the laptops. I am unable to split the titles list in these sub list. I tried a foolish approach as well where I was able to split the products by accessing the individual classes of every parameters but then it didn't match up to correct values. Price of some other laptop was shown on some other plus sponsored ads caused problems in it. Link of website:Amazon Macbook Scraping I just want these parameters either in my list or excel or Mysql database : Product name(Title of the row), Price,Operating System,Cpu model family,computer memory size and display size(6 columns)

like image 976
mozilla-firefox Avatar asked Sep 29 '18 19:09

mozilla-firefox


People also ask

Can we collect data from web scraping?

Web scraping is a tool for extracting vast volumes of data from websites through automated methods. Over two decades, a vast amount data has been made available online and millions are been generated each day, that's why web scraping is the most essential form of data collection.

How do I save a scraped data file to CSV?

The first way to create a CSV file with web scraping is to use the DataFrame. to_csv() method. This is pretty straightforward and just exports a DataFrame as a CSV file. However, in order to export the DataFrame, you first need to have your data as a DataFrame.

Can you use SQL for web scraping?

SQL Machine Learning language helps you in web scrapping with a small piece of code. In the previous articles for SQL Server R scripts, we explored the useful open-source libraries for adding new functionality in R.


1 Answers

Well, you have 2 different problems here as I see it:

  1. Fetching all the details you want for every item and putting them into a data structure.
  2. Saving that data with a DB or an Excel file (CSV for example).

So Let's assume all your Interested to get about a product is it's name and price (just for the matter of explaining), we'll create a simple class called Product:

class Product(object):
    def __init__(self, name, price):
        self.name = name
        self.price = price

And then, for every item you find, we will get it's price and name, and create an instance of product:

titles_element = browser.find_elements_by_xpath("//div[@class='s-item-container']")
products = []
for x in titles_element:
    try:
        name = x.find_element_by_class_name("s-access-title").text
        price = x.find_element_by_class_name("s-price").text
        products.append(Product(name, price))
    except WebDriverException:
        pass

You can, of course, get any other data that you are interested at getting using the right CSS/Xpath selector or even Regular expressions.

After that, you will have the data you want and it will be much easier to save using a DB, JSON, CSV or any other kind of data storing you'd like, let's take a look at saving that data to a CSV file for example:

import csv

def save_products_to_csv_file(product_list, file_name):
    with open(file_name, 'wb') as csvfile:
        fieldnames = ['name', 'price']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        for product in products:
            writer.writerow({'name': product.name, 'price': product.price})

And here is another example of storing your data into a SQLite DB using SQLAlchemy:

from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy import Column, String

Base = declarative_base()


# Creating a DB model class that reprsents our Product object.
class Product(Base):
    __tablename__ = 'products'

    # Here we define columns for the product
    id = Column(Integer, primary_key=True)
    name = Column(String)
    price = Column(String)


engine = create_engine('sqlite:///sqlalchemy_example.db')
Base.metadata.create_all(engine)


Session = sessionmaker(bind=engine)
session = Session()

titles_element = browser.find_elements_by_xpath("//div[@class='s-item-container']")
for x in titles_element:
    try:
        name = x.find_element_by_class_name("s-access-title").text
        price = x.find_element_by_class_name("s-price").text
        new_product = Product(name=name, price=price)
        session.add(new_product)
    except WebDriverException:
        pass

session.commit()
like image 112
Moshe Avatar answered Sep 23 '22 06:09

Moshe