Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character encoding issue when using Google Apps Script to extract data from web page

I have written a script using Google Apps Script to extract text from a web page into Google Sheets. I only need this script to work with a specific web page, so it does not need to be versatile. The script works almost exactly as I want it to except that I have run into a character encoding problem. I am extracting both Hebrew and English text. The meta tag in the HTML has charset=Windows-1255. The English extracts perfectly, but the Hebrew displays as black diamonds containing a question mark.

I found this question that says to pass the data into a blob then use the getDataAsString method to convert to another encoding. I tried converting to different encodings and got different results. UTF-8 displays the black diamonds with question marks, UTF-16 displays Korean, ISO 8859-8 returns an error and says it's not a valid parameter, and the original Windows-1255 displays one Hebrew character but a bunch of other gibberish.

However, I am able to copy and paste the Hebrew text into Google Sheets manually and it displays correctly.

I have even tested passing Hebrew directly from Google Apps Script code like so:

function passHebrew() {
  return "וַיְדַבֵּר";
}

This displays the Hebrew text properly on Google Sheets.

Hebrew displayed as each of the encodings I mentioned

My code is as follows:

function parseText(book, chapter) {
  //var bk = book;
  //var ch = chapter;
  var bk = '04'; //hard-coded for testing purposes
  var ch = '01'; //hard-coded for testing purposes
  var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';

  var xml = UrlFetchApp.fetch(url).getContentText();

  //I had to "fix" these xml errors for XmlService.parse(xml) below
  //to function.
  xml = xml.replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">', '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">');
  xml = xml.replace('<LINK REL="stylesheet" HREF="p.css" TYPE="text/css">', '<LINK REL="stylesheet" HREF="p.css" TYPE="text/css"></LINK>');
  xml = xml.replace('<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255">', '<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255"></meta>');
  xml = xml.replace(/ALIGN=CENTER/gi, 'ALIGN="CENTER"');
  xml = xml.replace(/<BR>/gi, '<BR></BR>');
  xml = xml.replace(/class=h/gi, 'class="h"');

  //This section is the specific route to the table in the page I want
  var document = XmlService.parse(xml);
  var body = document.getRootElement().getChildren("BODY");
  var maintable = body[0].getChildren("TABLE");
  var maintablechildren = maintable[0].getChildren();

  //This creates a two-dimensional array so that I can store the Hebrew
  //in the first column and the English in the second column
  var array = new Array(maintablechildren.length);
  for (var i = 0; i < maintablechildren.length; i++) {
    array[i] = new Array(2);
  }

  //This is where the table gets parsed into the array
  for (var i = 0; i < maintablechildren.length; i++) {
    var verse = maintablechildren[i].getChildren();

    //This is where the encoding problem occurs.
    //I originally tried verse[0].getText() but it didn't work.
    array[i][0] = Utilities.newBlob(verse[0].getText()).getDataAsString('UTF-8');
    //This array receives the English text and works fine.
    array[i][1] = verse[1].getText();
  }

  return array;
}

What am I overlooking, misunderstanding, or doing wrong? I don't have a very good understanding of how encoding works so I don't understand why converting it to UTF-8 isn't working.

like image 348
Iceape Avatar asked Nov 18 '15 23:11

Iceape


1 Answers

Your problem occurs before the lines you've commented as an encoding problem: because the default encoding for UrlFetchApp is munging the unicode text from the start.

You should use the variation of the .getContentText() method that Returns the content of an HTTP response encoded as a string of the given charset. For your case:

var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");

That should be all you need to change, although the blob() work-around is no longer needed. (It's harmless, though.) Other comments:

  • The logical OR operator (||) is very helpful for setting default values. I've tweaked the first few lines to enable testing but still let the function operate normally with arguments.

  • The way you're setting up an empty array before populating it with strings is Bad JavaScript; it's complex code that isn't needed, so toss it. Instead, we'll declare the array Array, then push() rows onto it.

  • The .replace() functions can be reduced with more clever RegExp use; I've included the URLs for demos of the really tricky ones.

  • There were \n newline characters in the text which I guessed were unnecessary for your purposes, so added a replace() for them as well.

Here's what you're left with:

function parseText(book, chapter) {
  var bk = book || '04'; //hard-coded for testing purposes
  var ch = chapter || '01'; //hard-coded for testing purposes
  var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';

  var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");

  //I had to "fix" these xml errors for XmlService.parse(xml) below
  //to function.
  xml = xml.replace(/(<!DOCTYPE.*EN")>/gi, '$1 "">')
           .replace(/(<(LINK|meta).*>)/gi,'$1</$2>')        // https://regex101.com/r/nH3pU8/1
           .replace(/(<.*?=)([^"']*?)([ >])/gi,'$1"$2"$3')  // https://regex101.com/r/eP7wO7/1
           .replace(/<BR>/gi, '<BR/>')
           .replace(/\n/g, '')

  //This section is the specific route to the table in the page I want
  var document = XmlService.parse(xml);
  var body = document.getRootElement().getChildren("BODY");
  var maintable = body[0].getChildren("TABLE");
  var maintablechildren = maintable[0].getChildren();

  //This is where the table gets parsed into the array
  var array = [];
  for (var i = 0; i < maintablechildren.length; i++) {
    var verse = maintablechildren[i].getChildren();

    //I originally tried verse[0].getText() but it didn't work.** It does now!
    var hebrew = verse[0].getText();
    //This array receives the English text and works fine.
    var english = verse[1].getText();
    array.push([hebrew,english]);
  }

  return array;
}

Results

 [
  [
    "  וַיְדַבֵּר יְהוָה אֶל-מֹשֶׁה בְּמִדְבַּר סִינַי, בְּאֹהֶל מוֹעֵד:  בְּאֶחָד לַחֹדֶשׁ הַשֵּׁנִי בַּשָּׁנָה הַשֵּׁנִית, לְצֵאתָם מֵאֶרֶץ מִצְרַיִם--לֵאמֹר.",
    " And the LORD spoke unto Moses in the wilderness of Sinai, in the tent of meeting, on the first day of the second month, in the second year after they were come out of the land of Egypt, saying:"
  ],
  [
    "  שְׂאוּ, אֶת-רֹאשׁ כָּל-עֲדַת בְּנֵי-יִשְׂרָאֵל, לְמִשְׁפְּחֹתָם, לְבֵית אֲבֹתָם--בְּמִסְפַּר שֵׁמוֹת, כָּל-זָכָר לְגֻלְגְּלֹתָם.",
    " 'Take ye the sum of all the congregation of the children of Israel, by their families, by their fathers' houses, according to the number of names, every male, by their polls;"
  ],
  [
    "  מִבֶּן עֶשְׂרִים שָׁנָה וָמַעְלָה, כָּל-יֹצֵא צָבָא בְּיִשְׂרָאֵל--תִּפְקְדוּ אֹתָם לְצִבְאֹתָם, אַתָּה וְאַהֲרֹן.",
    " from twenty years old and upward, all that are able to go forth to war in Israel: ye shall number them by their hosts, even thou and Aaron."
  ],
  ...
like image 192
Mogsdad Avatar answered Sep 28 '22 16:09

Mogsdad