To start, I'm very new at all this so get ready for some jacked up code from me copying/pasting from all kinds of sources.
I'm looking to be able to remove any html code that scrapy returns. I've got everything storing in MySQL with no issues, but the thing I can't get to work yet is removing a lot of '< td >' and other html tags. I initially just ran with /text().extract() but randomly it would come across a cell that was formatted this way:
<td> <span class="caps">TEXT</span> </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
There isn't a pattern to it that I can just choose between using /text or not, I'm looking for the easiest way that a beginner can implement that will strip all that off.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
import html2text
from scraper.items import LivingSocialDeal
class CFBDVRB(BaseSpider):
name = "cfbdvrb"
allowed_domains = ["url"]
start_urls = [
"url",
]
deals_list_xpath = '//table[@class="tbl data-table"]/tbody/tr'
item_fields = {
'title': './/td[1]',
'link': './/td[2]',
'location': './/td[3]',
'original_price': './/td[4]',
'price': './/td[5]',
}
def parse(self, response):
selector = HtmlXPathSelector(response)
for deal in selector.xpath(self.deals_list_xpath):
loader = XPathItemLoader(LivingSocialDeal(), selector=deal)
# define processors
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)
converter = html2text.HTML2Text()
converter.ignore_links = True
yield loader.load_item()
The converter = html2text was my last attempt at removing it that way, I'm not entirely sure I implemented it correctly but it didn't work.
Thanks in advance for any help you would like to give and I also apologize if I'm missing something easy that a quick search could pull up.
The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.
Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.
PHP provides an inbuilt function to remove the HTML tags from the data. The strip_tags() function is an inbuilt function in PHP that removes the strings form HTML, XML and PHP tags. It accepts two parameters. This function returns a string with all NULL bytes, HTML, and PHP tags stripped from a given $str.
The authors of Scrapy use a bunch of this functionality in their w3lib
which is part of/included with Scrapy.
Based on your code, you're using a pretty dated version of Scrapy (pre 0.22). I'm not sure exactly what's available to you, so you may need to import from scrapy.utils.markup
instead
If you have the variable my_text
that has your HTML text in it, do the following:
>>> from w3lib.html import remove_tags
>>> my_text
'<td> <span class="caps">TEXT</span> </td>\n<td> Text </td>\n<td> Text </td>\n<td> Text </td>\n<td> Text </td>'
>>> remove_tags(my_text)
u' TEXT \n Text \n Text \n Text \n Text '
There's a lot of additionally functionality for fixing up/converting html/markup with w3lib (code available here).
As this is just a function, it will be pretty easy to incorporate into your item loader, and will be more lightweight than using BS4.
Easiest way to do it is using BeautifulSoup. Even the Scrapy Documentation recommends it.
Imagine you have a variable called "html_text" with this html code inside:
<td> <span class="caps">TEXT</span> </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
Then you could use this to remove all the htmltags:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, 'html.parser')
just_text = soup.get_text()
Then the variable "just_text" will contain just the text:
TEXT
Text
Text
Text
I hope this solves your problem.
You can see more examples and the guide to install it (easier than Scrapy) at: BeautifulSoup
Good Luck!
EDIT:
Here you have a working example with the html you proposed:
from bs4 import BeautifulSoup
html_text = """
<td> <span class="caps">TEXT</span> </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
<td> Text </td>
"""
soup = BeautifulSoup(html_text, 'html.parser')
List_of_tds = soup.findAll('td')
for td_element in List_of_tds:
print td_element.get_text()
Please, note that you need to be using BeautifulSoup 4, which you can install following these instructions. If you have it, you can just copypaste that code and see what it does to other html and modify it to satisfy your needs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With