I have written a code for web scraping using python. The code extracts data of Macbook from amazon using selenium. Now I want to store these values in a Excel or MySql. There are various html/css class in a particular product row and one parent class which includes all the parameters of the product. To be Precise the code is:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import xlwt
from xlwt import Workbook
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
browser = webdriver.Chrome(executable_path='/home/mukesh/Desktop/backup/Programminghub/whatsapp_python_scripts/chromedriver_linux64/chromedriver', chrome_options=option)
# go to website of interest
browser.get("https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=macbook")
# wait up to 10 seconds for page to load
timeout = 10
try:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//img[@class='s-access-image cfMarker']")))
except TimeoutException:
print("Timed out waiting for page to load")
browser.quit()
titles_element = browser.find_elements_by_xpath("//div[@class='s-item-container']")
titles = []
for x in titles_element:
value=x.text
value=value.encode('ascii', 'ignore')
titles.append(value)
print(titles)
Now the output that I get is highly unstructured and contains some parameters which are there only on certain products. For instance the Parameter: "Maximum Resolution" or "CPU model manufacture" are present only on certain laptops and not on all.I don't want such parameters.I want only these parameters: Product name(Title of the row), Price,Operating System,Cpu model family,computer memory size and display size which are present on all the laptops. I am unable to split the titles list in these sub list. I tried a foolish approach as well where I was able to split the products by accessing the individual classes of every parameters but then it didn't match up to correct values. Price of some other laptop was shown on some other plus sponsored ads caused problems in it. Link of website:Amazon Macbook Scraping I just want these parameters either in my list or excel or Mysql database : Product name(Title of the row), Price,Operating System,Cpu model family,computer memory size and display size(6 columns)
Web scraping is a tool for extracting vast volumes of data from websites through automated methods. Over two decades, a vast amount data has been made available online and millions are been generated each day, that's why web scraping is the most essential form of data collection.
The first way to create a CSV file with web scraping is to use the DataFrame. to_csv() method. This is pretty straightforward and just exports a DataFrame as a CSV file. However, in order to export the DataFrame, you first need to have your data as a DataFrame.
SQL Machine Learning language helps you in web scrapping with a small piece of code. In the previous articles for SQL Server R scripts, we explored the useful open-source libraries for adding new functionality in R.
Well, you have 2 different problems here as I see it:
So Let's assume all your Interested to get about a product is it's name and price (just for the matter of explaining), we'll create a simple class called Product:
class Product(object):
def __init__(self, name, price):
self.name = name
self.price = price
And then, for every item you find, we will get it's price and name, and create an instance of product:
titles_element = browser.find_elements_by_xpath("//div[@class='s-item-container']")
products = []
for x in titles_element:
try:
name = x.find_element_by_class_name("s-access-title").text
price = x.find_element_by_class_name("s-price").text
products.append(Product(name, price))
except WebDriverException:
pass
You can, of course, get any other data that you are interested at getting using the right CSS/Xpath selector or even Regular expressions.
After that, you will have the data you want and it will be much easier to save using a DB, JSON, CSV or any other kind of data storing you'd like, let's take a look at saving that data to a CSV file for example:
import csv
def save_products_to_csv_file(product_list, file_name):
with open(file_name, 'wb') as csvfile:
fieldnames = ['name', 'price']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for product in products:
writer.writerow({'name': product.name, 'price': product.price})
And here is another example of storing your data into a SQLite DB using SQLAlchemy:
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy import Column, String
Base = declarative_base()
# Creating a DB model class that reprsents our Product object.
class Product(Base):
__tablename__ = 'products'
# Here we define columns for the product
id = Column(Integer, primary_key=True)
name = Column(String)
price = Column(String)
engine = create_engine('sqlite:///sqlalchemy_example.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
titles_element = browser.find_elements_by_xpath("//div[@class='s-item-container']")
for x in titles_element:
try:
name = x.find_element_by_class_name("s-access-title").text
price = x.find_element_by_class_name("s-price").text
new_product = Product(name=name, price=price)
session.add(new_product)
except WebDriverException:
pass
session.commit()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With