Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy parse javascript

I have a javascript on the page like below:

new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",

i want to get "185310341". I am searching on google about a few hours but couldn't find anything, I hope u can help me. How can i scrape that javascript and get that id?

I tried that code :

id = sel.search('"id":(.*?),',text).group(1)
print id

but i got:

exceptions.AttributeError: 'Selector' object has no attribute 'search'
like image 747
Muhammet Arslan Avatar asked May 14 '14 18:05

Muhammet Arslan


People also ask

Can Scrapy handle JavaScript?

Executing JavaScript in Scrapy with ScrapingBee ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.

Is Scrapy better than BeautifulSoup?

Scrapy is a more robust, feature-complete, more extensible, and more maintained web scraping tool. Scrapy allows you to crawl, extract, and store a full website. BeautilfulSoup on the other end only allows you to parse HTML and extract the information you're looking for.

Is Scrapy better than selenium?

Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.

Can Scrapy handle dynamic websites?

Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.So if we want to scrape the dynamic website we have to use selenium driver or other webdriver.


2 Answers

Scrapy selectors have built-in support for regular expressions:

sel.xpath('<xpath_to_find_the_element_text>').re(r'"id":(\d+)')

Demo showing the work of this particular regular expression:

>>> import re
>>> s = 'new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",'
>>> re.search('"id":(\d+)', s).group(1)
'185310341' 
like image 156
alecxe Avatar answered Sep 22 '22 02:09

alecxe


An alternative to the regex approach is to use a Javascript parser, convert the output of that parser to an XML document, and parse it with XPath.

That's what implemented in js2xml, which uses slimit and lxml (disclaimer: I wrote js2xml; warning: not stable)

In your case, check this sample scrapy shell session, using js2xml.jsonlike.getall():

paul:~$ scrapy shell http://2loom.com/products/2loom-design-siyah-beyaz-kalpli
2014-05-19 16:12:00+0200 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
2014-05-19 16:12:00+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-05-19 16:12:00+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled item pipelines: 
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-19 16:12:00+0200 [default] INFO: Spider opened
2014-05-19 16:12:01+0200 [default] DEBUG: Crawled (200) <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f8552946610>
[s]   item       {}
[s]   request    <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s]   response   <200 http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <Spider 'default' at 0x7f8552384b90>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.
  warn("The top-level `frontend` package has been deprecated. "

In [1]: scripts = response.selector.xpath('//script/text()').extract()

In [2]: import js2xml, js2xml.jsonlike

In [3]: js = js2xml.parse(scripts[-1])

In [4]: js2xml.jsonlike.getall(js)
Out[4]: 
[{'onVariantSelected': 'selectCallback',
  'product': {'available': True,
   'compare_at_price': None,
   'compare_at_price_max': 0,
   'compare_at_price_min': 0,
   'compare_at_price_varies': False,
   'content': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
   'created_at': '2013-11-29T13:37:11+02:00',
   'description': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
   'featured_image': '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
   'handle': '2loom-design-siyah-beyaz-kalpli',
   'id': 185310341,
   'images': ['//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
    '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwarte_hartjes_ak_girls.jpg?v=1389259259',
    '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_boys.jpg?v=1389259264',
    '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwartje_hartjes_ak_boys.jpg?v=1389259264'],
   'options': ['Size'],
   'price': 15900,
   'price_max': 15900,
   'price_min': 15900,
   'price_varies': False,
   'published_at': '2013-11-29T13:34:20+02:00',
   'tags': [u'2\xb7Loom',
    'Beyaz',
    'Design',
    'Ekrek',
    u'Kad\u0131n',
    'Kalpli',
    'Lacivert'],
   'title': '10. Design | Siyah & beyaz kalpli',
   'type': '2 Loom Limiteds',
   'variants': [{'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584985,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 3,
     'option1': 'XS (34-36: 1.60m-1.70m)',
     'option2': None,
     'option3': None,
     'options': ['XS (34-36: 1.60m-1.70m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-XS',
     'taxable': True,
     'title': 'XS (34-36: 1.60m-1.70m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584989,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 3,
     'option1': 'S (36-38: 1.65m-1.75m)',
     'option2': None,
     'option3': None,
     'options': ['S (36-38: 1.65m-1.75m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-S',
     'taxable': True,
     'title': 'S (36-38: 1.65m-1.75m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584997,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 7,
     'option1': 'M (38-40: 1.70m-1.80m)',
     'option2': None,
     'option3': None,
     'options': ['M (38-40: 1.70m-1.80m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-M',
     'taxable': True,
     'title': 'M (38-40: 1.70m-1.80m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424585001,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 7,
     'option1': 'L (40-42: 1.75m-1.85m)',
     'option2': None,
     'option3': None,
     'options': ['L (40-42: 1.75m-1.85m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-L',
     'taxable': True,
     'title': 'L (40-42: 1.75m-1.85m)',
     'weight': 0}],
   'vendor': u'2\xb7Loom'}}]

In [5]: 
like image 33
paul trmbrth Avatar answered Sep 23 '22 02:09

paul trmbrth