Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting underlying data via RSelenium with embedded leaflet svg, and more

I would like to extract information about each ad in this link. Now, I got to the stage where I can automatically click See Ad Details, but there is much underlying data that is not straightforward to wrangle into a neat dataframe.

library(RSelenium)
rs <- rsDriver()
remote <- rs$client
remote$navigate(
  paste0(
    "https://www.facebook.com/ads/library/?", 
    "active_status=all&ad_type=political_and_issue_ads&country=US&", 
    "impression_search_field=has_impressions_lifetime&", 
    "q=actblue&view_all_page_id=38471053686"
  )
)

test <- remote$findElement(using = "xpath", "//*[@class=\"_7kfh\"]")
test$clickElement()
## Manually figured out element
test <- remote$findElement(using = "xpath", "//*[@class=\"_7lq0\"]")
test$getElementText()

The output text is messy itself but I believe with some time and effort, it can be wrangled into something useful. The problem is wrangling the underlying data in

  1. the graph, which seems to be just an image, and
  2. leaflet svg, which displays data when a cursor hovers over it.

I am at a loss to how to systematically extract this image and especially the leaflet svg. How would I take each ad and then extract the full data available in the details in this case?

like image 767
Kim Avatar asked Feb 09 '20 08:02

Kim


2 Answers

Age and gender graphic are a canva element. To get them as an images, you can take a screenshot of the element. Python example:

driver.find_element_by_tag_name('canvas').screenshot("age_and_gender.png")

Where this ad was shown is a SVG and you can save it as image in same way. Result will be not very accurate, because visible part of SVG and actual is different. But you can crop the image after. Python example:

driver.find_element_by_tag_name('svg').screenshot("where_this_ad_was_shown.png")

To extract full data from it, you cannot use Selenium. The way you can get the data is to config proxy server, catch API request, and get data that will be in JSON format. And yes it's possible.


Easy way is to use some requests to get ADs and details without Selenium. Python working example:

import json
import requests

params = (
    ('q', 'actblue'),
    ('count', '1000'), # default is 30, for 38471053686 it will return about 300 results.
    ('active_status', 'all'),
    ('ad_type', 'political_and_issue_ads'),
    ('countries/[0/]', 'US'),
    ('impression_search_field', 'has_impressions_lifetime'),
    ('view_all_page_id', '38471053686'),
)

data = {'__a': '1', }

with requests.session() as s:
    response = s.post('https://www.facebook.com/ads/library/async/search_ads/', params=params, data=data)
    ads = json.loads(response.text.replace('for (;;);', ''))['payload']['results']
    for ad in ads:
        ad_details_params = (
            ('ad_archive_id', ad[0]['adArchiveID']),
            ('country', 'US'),
        )
        response = s.post('https://www.facebook.com/ads/library/async/insights/', params=ad_details_params, data=data)
        print('parse json from response')

Not: Facebook not allows for automated data collection without written permission https://www.facebook.com/apps/site_scraping_tos_terms.php

But as we all know, Facebook does not refuse to collect our data.

Response for each AD detail will be like:

{
  "__ar": 1,
  "payload": {
    "ageGenderData": [
      {
        "age_range": "18-24",
        "female": 0.03,
        "male": 0.05,
        "unknown": 0
      },
      {
        "age_range": "25-34",
        "female": 0.12,
        "male": 0.12,
        "unknown": 0.01
      },
      {
        "age_range": "35-44",
        "female": 0.16,
        "male": 0.09,
        "unknown": 0
      },
      {
        "age_range": "45-54",
        "female": 0.11,
        "male": 0.05,
        "unknown": 0
      },
      {
        "age_range": "55-64",
        "female": 0.09,
        "male": 0.04,
        "unknown": 0
      },
      {
        "age_range": "65+",
        "female": 0.09,
        "male": 0.03,
        "unknown": 0
      }
    ],
    "currency": "USD",
    "currencyMatched": true,
    "impressions": "35\u00a0B - 40\u00a0B",
    "locationData": [
      {
        "reach": 0,
        "region": "Alabama"
      },
      {
        "reach": 0,
        "region": "Utah"
      },
      {
        "reach": 0,
        "region": "Maine"
      },
      {
        "reach": 0,
        "region": "Louisiana"
      },
      {
        "reach": 0,
        "region": "Kentucky"
      },
      {
        "reach": 0,
        "region": "Kansas"
      },
      {
        "reach": 0,
        "region": "Idaho"
      },
      {
        "reach": 0,
        "region": "Delaware"
      },
      {
        "reach": 0,
        "region": "Connecticut"
      },
      {
        "reach": 0,
        "region": "Arkansas"
      },
      {
        "reach": 0,
        "region": "Hawaii"
      },
      {
        "reach": 0,
        "region": "Alaska"
      },
      {
        "reach": 0,
        "region": "Montana"
      },
      {
        "reach": 0,
        "region": "West Virginia"
      },
      {
        "reach": 0,
        "region": "Vermont"
      },
      {
        "reach": 0,
        "region": "Mississippi"
      },
      {
        "reach": 0,
        "region": "Wyoming"
      },
      {
        "reach": 0,
        "region": "Oklahoma"
      },
      {
        "reach": 0,
        "region": "North Dakota"
      },
      {
        "reach": 0,
        "region": "New Mexico"
      },
      {
        "reach": 0,
        "region": "New Hampshire"
      },
      {
        "reach": 0,
        "region": "Nebraska"
      },
      {
        "reach": 0,
        "region": "Rhode Island"
      },
      {
        "reach": 0,
        "region": "South Dakota"
      },
      {
        "reach": 0.01,
        "region": "Wisconsin"
      },
      {
        "reach": 0.01,
        "region": "Missouri"
      },
      {
        "reach": 0.01,
        "region": "Oregon"
      },
      {
        "reach": 0.01,
        "region": "Minnesota"
      },
      {
        "reach": 0.01,
        "region": "Maryland"
      },
      {
        "reach": 0.01,
        "region": "New Jersey"
      },
      {
        "reach": 0.01,
        "region": "Tennessee"
      },
      {
        "reach": 0.01,
        "region": "Washington, District of Columbia"
      },
      {
        "reach": 0.01,
        "region": "Indiana"
      },
      {
        "reach": 0.02,
        "region": "Michigan"
      },
      {
        "reach": 0.02,
        "region": "Iowa"
      },
      {
        "reach": 0.02,
        "region": "North Carolina"
      },
      {
        "reach": 0.02,
        "region": "Georgia"
      },
      {
        "reach": 0.02,
        "region": "Colorado"
      },
      {
        "reach": 0.02,
        "region": "Ohio"
      },
      {
        "reach": 0.02,
        "region": "Arizona"
      },
      {
        "reach": 0.02,
        "region": "Pennsylvania"
      },
      {
        "reach": 0.02,
        "region": "Virginia"
      },
      {
        "reach": 0.03,
        "region": "Washington"
      },
      {
        "reach": 0.03,
        "region": "Massachusetts"
      },
      {
        "reach": 0.04,
        "region": "Illinois"
      },
      {
        "reach": 0.04,
        "region": "Florida"
      },
      {
        "reach": 0.06,
        "region": "New York"
      },
      {
        "reach": 0.13,
        "region": "California"
      },
      {
        "reach": 0.19,
        "region": "Texas"
      }
    ],
    "singleCountry": "US",
    "spend": "$500 - $599",
    "pageSpend": {
      "currentWeek": null,
      "isPoliticalPage": true,
      "weeklyByDisclaimer": {
        "WARREN FOR PRESIDENT, INC.": 270970
      },
      "lifetimeByDisclaimer": {
        "Elizabeth for MA": 781272,
        "Warren for President": 3396973,
        "": 13584,
        "WARREN FOR PRESIDENT, INC.": 4081618,
        "the Elizabeth Warren Presidential Exploratory Committee": 219471
      },
      "hasPoliticalSpendInAnyCountry": true
    },
    "pageBlurb": "United States Senator from Massachusetts, former teacher, and candidate for President of the United States. (official campaign account)"
  },
  "bootloadable": {},
  "ixData": {},
  "bxData": {},
  "gkxData": {},
  "qexData": {},
  "lid": "6796246259692811543"
}

Finally, to run this python code from R, use reticulate, and simply run the entire python script as a string - note that if the python script doesn't contain any " characters, it makes it very convenient to drop straight into R, like so

library(reticulate)
py_run_string("import json
import requests
rest of script etc 
etc 
etc")

Also that you will need to install the two python libraries the script uses. This can be done by opening terminal on mac, and typing pip install json to install the json python library, and pip install requests for the requests library)

like image 170
Sers Avatar answered Oct 05 '22 22:10

Sers


This is a not a complete answer, but hopefully it may help.

I had a go scraping/parsing, but couldn't make sense of the graph data as it seems to be located in complex locations across many files accessed through the 'network' tab in chrome dev tools (I found patches of the data, by using command+f from inside the network tab and searching for words contained in the graphs e.g. 'Women', 'Unknown' etc)

Someone who is familiar with ReactJS may have more luck!

What may work

You could try a totally different method using optical character recognition (OCR).

That is, take a screenshot (i.e. remote$screenshot()), convert from base64 to image, read it, extract the relevant area (i.e. the locations of the specific data you're after), and use methods described here to convert the areas containing the data you're after into text! (I will update if I get a chance to try it, but not looking likely, keen to hear how you go)

like image 29
stevec Avatar answered Oct 05 '22 22:10

stevec