Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy:Sitemap spider and gzipped files

Tags:

scrapy

sitemap

I tried running the sitemap spider but it refused to crawl gzipped sitemaps.It gave the following error

[scrapy] WARNING: Ignoring non-XML sitemap 

is there a setting that needs to be enabled to allow parsing of gzipped sitemaps?

I use scrapy version 0.15

like image 724
Sanket Gupta Avatar asked Sep 02 '25 09:09

Sanket Gupta


1 Answers

Scrapy should automatically unzip the gzipped content.

See the responsible code in contrib/spiders/sitemap.py

        if isinstance(response, XmlResponse):
            body = response.body
        elif is_gzipped(response):
            body = gunzip(response.body)
        else:
            log.msg("Ignoring non-XML sitemap: %s" % response, log.WARNING)
            return

I think either the XML is malformed, or the file isn't gzipped with the proper headers. I suggest trying the same spider on a sitemap of which you are sure of it's formatting.

If you want I can run test of my own, if you can provide me with your current code -- it'll allow me to give you a better answer :-).

like image 147
Sjaak Trekhaak Avatar answered Sep 13 '25 07:09

Sjaak Trekhaak