Scrapy

Question

I'm trying to write a small script that will extract steam game tags and store them in a csv file. The issue I'm having currently is that I do not know how to remove the html tags from my output. My code is below

from __future__ import absolute_import
import scrapy
from Example.items import SteamItem
from scrapy.selector import HtmlXPathSelector


class SteamSpider(scrapy.Spider):
    name = 'steamspider'
    allowed_domains = ['https://store.steampowered.com/app']
    start_urls = ["https://store.steampowered.com/app/578080/PLAYERUNKNOWNS_BATTLEGROUNDS/",]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    tags = hxs.xpath('//*[@id="game_highlights"]/div[1]/div/div[4]/div/div[2]')
    for sel in tags:
        item = SteamItem()
        item['gametags'] = sel.xpath('.//a/text()').extract()
        item['gametitle'] = sel.xpath('//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3]/text()').extract()
    yield item

My Item class:

class SteamItem(scrapy.Item):
    #defining item fields
    url = scrapy.Field()
    gametitle = scrapy.Field()
    gametags = scrapy.Field()

My output then looks like this:

[u'
												Survival												',
 u'
												Shooter												',
 u'
												Multiplayer												',
 u'
												PvP												',
 u'
												Third-Person Shooter												',
 u'
												FPS												',
 u'
												Action												',
 u'
												Battle Royale												',
 u'
												Online Co-Op												',
 u'
												Tactical												',
 u'
												Co-op												',
 u'
												Early Access												',
 u'
												First-Person												',
 u'
												Violent												',
 u'
												Strategy												',
 u'
												Third Person												',
 u'
												Competitive												',
 u'
												Team-Based												',
 u'
												Difficult												',
 u'
												Simulation												'],

My objective is to remove all the tags "u' .....

Any ideas?

Thanks!

Len Lin · Accepted Answer

Since you are using Scrapy framework, you can use a library that comes with Scrapy called w3lib

import w3lib.html
output= w3lib.html.remove_tags(input)
print(output)

scrapy.utils.markup is depreciated in 2019 and please use w3lib instead.

You can refer to https://w3lib.readthedocs.io/en/latest/index.html for more info.

JB.py · Answer

Simply Use remove_tags

from scrapy.utils.markup import remove_tags
ToRemove = remove_tags(YourOutPut)
print(ToRemove)

This will solve your problem

Scrapy - removing html tags in a list output

Tags:

python

web-scraping

r_user

2 Answers

Len Lin

JB.py

Recent Activity

Donate For Us