Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xpath in apps script?

I made a formula to extract some Wikipedia data in Google Seets which works fine. Here is the formula:

=regexreplace(join("",flatten(IMPORTXML(D2,".//p[preceding-sibling::h2[1][contains(., 'Geography')]]"))),"\[[^\]]+\]","")&char(10)&char(10)&iferror(regexreplace(join("",flatten(IMPORTXML(D2,".//p[preceding-sibling::h2[1][contains(., 'Education')]]"))),"\[[^\]]+\]",""))

Where D2 is a URL like https://en.wikipedia.org/wiki/Abbeville,_Alabama

This extracts some Geography and Education data from the Wikipedia page. Trouble is that importxml only runs a few times before it dies due to quota.

So I thought maybe better to use Apps Script where there are much higher limits on fetching and parsing. I could not see a good way however of using Xpath in Apps Script. Older posts on the web discuss using a deprecated service called Xml but it seems to no longer work. There is a Service called XmlService which looks like it may do the job but you can't just plug in an Xpath. It looks like a lot of sweating to get to the result. Any solutions out there where you can just plug in Xpath?

like image 226
michaeldon Avatar asked May 01 '26 05:05

michaeldon


1 Answers

Here is an alternative solution I actually do in a case like this.

I have used XmlService but only for parsing the content, not for using Xpath. This makes use of the element tags and so far pretty consistent on my tests. Although, it might need tweaks when certain tags are in the result and you might have to include them into the exclusion condition.

Tested the code below in both links:

  • https://en.wikipedia.org/wiki/Abbeville,_Alabama#Geography
  • https://en.wikipedia.org/wiki/Montgomery,_Alabama#Education

My test shows that the formula above used did not return the proper output from the 2nd link while the code does. (Maybe because it was too long)

Code:

function getGeoAndEdu(path) {
  var data = UrlFetchApp.fetch(path).getContentText();
  // wikipedia is divided into sections, if output is cut, increase the number
  var regex = /.{1,100000}/g;
  var results = [];
  // flag to determine if matches should be added
  var foundFlag = false;

  do {
    m = regex.exec(data);
    if (foundFlag) {
      // if another header is found during generation of data, stop appending the matches
      if (matchTag(m[0], "<h2>"))
        foundFlag = false;
      // exclude tables, sub-headers and divs containing image description
      else if(matchTag(m[0], "<div") || matchTag(m[0], "<h3") ||
              matchTag(m[0], "<td")  || matchTag(m[0], "<th"))
        continue;
      else
        results.push(m[0]);
    }
    // start capturing if either IDs are found
    if (m != null && (matchTag(m[0], "id=\"Geography\"") || 
                      matchTag(m[0], "id=\"Education\""))) {
      foundFlag = true;
    }
  } while (m);

  var output = results.map(function (str) {
    // clean tags for XmlService
    str = str.replace(/<[^>]*>/g, '').trim();
    decode = XmlService.parse('<d>' + str + '</d>')
    // convert html entity codes (e.g. &#160;) to text
    return decode.getRootElement().getText();
    // filter blank results due to cleaning and empty sections
    // separate data and remove citations before returning output
  }).filter(result => result.trim().length > 1).join("\n").replace(/\[\d+\]/g, ''); 

  return output;
}

// check if tag is found in string
function matchTag(string, tag) {
  var regex = RegExp(tag);
  return string.match(regex) && string.match(regex)[0] == tag;
}

Output:

output

Difference:

  • Formula ending output output1
  • Script ending output output2
  • Education ending in wikipedia output

Note:

  • You still have quota when using UrlFetchApp but should be better than IMPORTXML's limit depending on the type of your account. quota

Reference:

  • Apps Script Quotas
like image 169
NightEye Avatar answered May 02 '26 21:05

NightEye



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!