Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse an HTML string in Google Apps Script without using XmlService? [duplicate]

I want to create a scraper using Google Spreadsheets with Google Apps Script. I know it is possible and I have seen some tutorials and threads about it.

The main idea is to use:

  var html = UrlFetchApp.fetch('http://en.wikipedia.org/wiki/Document_Object_Model').getContentText();   var doc = XmlService.parse(html); 

And then get and work with the elements. However, the method

XmlService.parse() 

Does not work for some page. For example, if I try:

function test(){     var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();     var parse = XmlService.parse(html); } 

I get the following error:

Error on line 225: The entity name must immediately follow the '&' in the entity reference. (line 3, file "") 

I've tried to use string.replace() to eliminate the characters that apparently are causing the error, but it does not work. All sort of other errors appear. The following code for example:

function test(){     var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();     var regExp = new RegExp("&", "gi");     html = html.replace(regExp,"");      var parse = XmlService.parse(html); } 

Gives me the following error:

Error on line 358: The content of elements must consist of well-formed character data or markup. (line 6, file "") 

I believe this is a problem with the XmlService.parse() method.

I've read in this threads:

Google App Script parse table from messed html and What is the best way to parse html in google apps script that one can use a deprecated method called xml.parse() which does accept a second parameter that allows parsing HTML. However, as I've mentioned, it is deprecated and I can not find any documentation on it anywhere. The xml.parse() seems to parse the string, but I have trouble working with the elements due to the lack of documentation. And it's also not the safest long term solution, because it can be deactivated any time soon.

So, I want to know how do I parse this HTML in Google Apps Script?

I also tried:

function test(){      var html = UrlFetchApp.fetch("https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-preto-lima-neon-c60-220v").getContentText();     var htmlOutput = HtmlService.createHtmlOutput(html).getContent();      var parse = XmlService.parse(htmlOutput); } 

But it does not work, I get this error:

Malformed HTML content:

I thought about using a open source library to parse the HTML, but I could not find any.

My ultimate goal is to get some information from a set of pages like Price, Link, Name of the products, etc. I've manage to do this using a series of RegEx:

var ss = SpreadsheetApp.getActiveSpreadsheet();   var linksSheet = ss.getSheetByName("Links");   var resultadosSheet = ss.getSheetByName("Resultados");  function scrapyLoco(){    var links = linksSheet.getRange(1, 1, linksSheet.getLastRow(), 1).getValues();   var arrayGrandao = [];   for (var row =  0, len = links.length; row < len; row++){    var link = links[row];      var arrayDeResultados = pegarAsCoisas(link[0]);    Logger.log(arrayDeResultados);    arrayGrandao.push(arrayDeResultados);   }        resultadosSheet.getRange(2, 1, arrayGrandao.length, arrayGrandao[0].length).setValues(arrayGrandao);  }   function pegarAsCoisas(linkDoProduto) {   var resultadoArray = [];    var html = UrlFetchApp.fetch(linkDoProduto).getContentText();   var regExp = new RegExp("<h1([^]*)h1>", "gi");   var h1Html = regExp.exec(html);   var h1Parse = XmlService.parse(h1Html[0]);   var h1Output = h1Parse.getRootElement().getText();   h1Output = h1Output.replace(/(\r\n|\n|\r|(^( )*))/gm,"");    regExp = new RegExp("Ref.: ([^(])*", "gi");   var codeHtml = regExp.exec(html);   var codeOutput = codeHtml[0].replace("Ref.: ","").replace(" ","");    regExp = new RegExp("margin-top: 5px; margin-bottom: 5px; padding: 5px; background-color: #699D15; color: #fff; text-align: center;([^]*)/div>", "gi");   var descriptionHtml = regExp.exec(html);   var regExp = new RegExp("<p([^]*)p>", "gi");   var descriptionHtml = regExp.exec(descriptionHtml);   var regExp = new RegExp("^[^.]*", "gi");   var descriptionHtml = regExp.exec(descriptionHtml);   var descriptionOutput = descriptionHtml[0].replace("<p>","");   descriptionOutput = descriptionOutput+".";    regExp = new RegExp("ecom(.+?)Main.png", "gi");   var imageHtml = regExp.exec(html);   var comecoDaURL = "https://www.nespresso.com/";   var imageOutput = comecoDaURL+imageHtml[0];    var regExp = new RegExp("nes_l-float nes_big-price nes_big-price-with-out([^]*)p>", "gi");   var precoHtml = regExp.exec(html);   var regExp = new RegExp("[0-9]*,", "gi");   precoHtml = regExp.exec(precoHtml);   var precoOutput = "BRL "+precoHtml[0].replace(",","");    resultadoArray = [codeOutput,h1Output,descriptionOutput,"Home & Garden > Kitchen & Dining > Kitchen Appliances > Coffee Makers & Espresso Machines",                     "Máquina",linkDoProduto,imageOutput,"new","in stock",precoOutput,"","","","Nespresso",codeOutput];    return resultadoArray; } 

But this is very timing consuming to program, it is very hard to change it dynamically and is not very reliable.

I need a way to parse this HTML and easily access its elements. It´s actually not a add on. but a simple google app script..

like image 642
user3347814 Avatar asked Nov 24 '15 11:11

user3347814


People also ask

How do you parse HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

How do I write an HTML in Google script?

To add an HTML file to your Apps Script project, open the Script Editor and choose File > New > HTML File. Within the HTML file, you can write most standard HTML, CSS, and client-side JavaScript. The page will be served as HTML5, although some advanced features of HTML5 are not available, as explained in Restrictions.

How do you concatenate in a Google App script?

To concatenate two strings (i.e., to join them together), use the concatenation operator ( + ). Both the concatenation operator and the addition operator have the same symbol + but Apps Script will figure out which operation to perform based on the values being operated on.

What is trigger in Google App script?

Triggers let Apps Script run a function automatically when a certain event, like opening a document, occurs. Simple triggers are a set of reserved functions built into Apps Script, like the function onOpen(e) , which executes when a user opens a Google Docs, Sheets, Slides, or Forms file.


2 Answers

I made cheeriogs for your problem. it's works on GAS as cheerio which is jQuery-like api. You can do that like this.

const content = UrlFetchApp.fetch('https://example.co/').getContentText(); const $ = Cheerio.load(content); Logger.log($('p .blah').first().text()); // blah blah blah ... 

See also https://github.com/asciian/cheeriogs

like image 95
asciian Avatar answered Oct 05 '22 12:10

asciian


This has been discussed before - see this Q&A.

Unlike XML service, the XMLService is not very forgiving of malformed HTML. The trick in the answer by Justin Bicknell does the job. Even though XML service has been deprecated, it still continues to work.

like image 38
Sujay Phadke Avatar answered Oct 05 '22 12:10

Sujay Phadke