Removing HTML tags without /text().extract()

Tags:

To start, I'm very new at all this so get ready for some jacked up code from me copying/pasting from all kinds of sources.

I'm looking to be able to remove any html code that scrapy returns. I've got everything storing in MySQL with no issues, but the thing I can't get to work yet is removing a lot of '< td >' and other html tags. I initially just ran with /text().extract() but randomly it would come across a cell that was formatted this way:

<td>    <span class="caps">TEXT</span>  </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>

There isn't a pattern to it that I can just choose between using /text or not, I'm looking for the easiest way that a beginner can implement that will strip all that off.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
import html2text
from scraper.items import LivingSocialDeal


class CFBDVRB(BaseSpider):
    name = "cfbdvrb"
    allowed_domains = ["url"]
    start_urls = [
        "url",
    ]

    deals_list_xpath = '//table[@class="tbl data-table"]/tbody/tr'
    item_fields = {
        'title': './/td[1]',
        'link': './/td[2]',
        'location': './/td[3]',
        'original_price': './/td[4]',
        'price': './/td[5]',
    }

    def parse(self, response):
        selector = HtmlXPathSelector(response)

        for deal in selector.xpath(self.deals_list_xpath):
            loader = XPathItemLoader(LivingSocialDeal(), selector=deal)

            # define processors
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()

            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)

            converter = html2text.HTML2Text()
            converter.ignore_links = True
            yield loader.load_item()

The converter = html2text was my last attempt at removing it that way, I'm not entirely sure I implemented it correctly but it didn't work.

Thanks in advance for any help you would like to give and I also apologize if I'm missing something easy that a quick search could pull up.

820

asked Oct 23 '15 16:10

Parab00n

2 Answers

The authors of Scrapy use a bunch of this functionality in their w3lib which is part of/included with Scrapy.

Based on your code, you're using a pretty dated version of Scrapy (pre 0.22). I'm not sure exactly what's available to you, so you may need to import from scrapy.utils.markup instead

If you have the variable my_text that has your HTML text in it, do the following:

>>> from w3lib.html import remove_tags
>>> my_text
'<td>    <span class="caps">TEXT</span>  </td>\n<td>    Text    </td>\n<td>    Text    </td>\n<td>    Text    </td>\n<td>    Text    </td>'
>>> remove_tags(my_text)
u'    TEXT  \n    Text    \n    Text    \n    Text    \n    Text    '

There's a lot of additionally functionality for fixing up/converting html/markup with w3lib (code available here).

As this is just a function, it will be pretty easy to incorporate into your item loader, and will be more lightweight than using BS4.

155

answered Sep 23 '22 18:09

Rejected

Easiest way to do it is using BeautifulSoup. Even the Scrapy Documentation recommends it.

Imagine you have a variable called "html_text" with this html code inside:

<td>    <span class="caps">TEXT</span>  </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>

Then you could use this to remove all the htmltags:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, 'html.parser')
just_text = soup.get_text()

Then the variable "just_text" will contain just the text:

TEXT
Text
Text
Text

I hope this solves your problem.

You can see more examples and the guide to install it (easier than Scrapy) at: BeautifulSoup

Good Luck!

EDIT:

Here you have a working example with the html you proposed:

from bs4 import BeautifulSoup


html_text = """
<td>    <span class="caps">TEXT</span>  </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
"""

soup = BeautifulSoup(html_text, 'html.parser')

List_of_tds = soup.findAll('td')

for td_element in List_of_tds:
    print td_element.get_text()

Please, note that you need to be using BeautifulSoup 4, which you can install following these instructions. If you have it, you can just copypaste that code and see what it does to other html and modify it to satisfy your needs.

answered Sep 22 '22 18:09

davidRA

Related questions
                            
                                How to create a dict with letters as keys in a concise way?
                            
                                Python - how to check system load?
                            
                                redis-py AttributeError: 'module' object has no attribute
                            
                                Scrapy:In a request fails (eg 404,500), how to ask for another alternative request?
                            
                                I cannot import beautiful soup on python
                            
                                Raspberry Pi - psutil install error
                            
                                Why am I getting an import error for importing process on python 3.3?
                            
                                No module named 'forms' Django
                            
                                pandas read excel: do not parse numbers
                            
                                TypeError: __init__() takes 0 positional arguments but 1 was given
                            
                                Can't uninstall Python 3.4.2 from Windows 7 after system restore
                            
                                Redirecting a View to another View in Django Python
                            
                                Python smoothing data
                            
                                How can I get a Python program to kill itself using a command run through the module sys?
                            
                                Find the second closest index to value
                            
                                Package (Python PIL/Pillow) installed but I can't import it
                            
                                Tkinter: How to make a button center itself?
                            
                                How can I represent this regex to not get a "bad character range" error?
                            
                                How to install OpenCV on Windows and enable it for PyCharm without using the package manager
                            
                                Calling async_result.get() from within a celery task

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing HTML tags without /text().extract()

Tags:

python

python-2.7

scrapy

Parab00n

People also ask

2 Answers

Rejected

davidRA

Recent Activity

Donate For Us