Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fetching non-utf8 data with XMLHttpRequest

I want to fetch a document from the web with xmlHttpRequest. However the text in question isn't utf8 (in this case it's windows-1251 but in the generic case, I wouldn't know that for sure).

However, if I use responseType="text" it treats it as though the string is utf8, ignoring the charset in the content-type (resulting in a nasty mess).

If I used 'blob' (probably the nearest thing to what I want), could I then convert that to a DomString taking into account the encoding?

like image 616
Tom Tanner Avatar asked Oct 15 '17 08:10

Tom Tanner


Video Answer


2 Answers

I actually found an API which does what I want, from here:

https://developers.google.com/web/updates/2014/08/Easier-ArrayBuffer-String-conversion-with-the-Encoding-API

Basically, use responseType="arraybuffer", pick the encoding from the returned headers, and use DataView and TextDecoder. It does exactly what is required.

const xhr = new XMLHttpRequest();
xhr.responseType = "arraybuffer";
xhr.onload = function() {
  const contenttype = xhr.getResponseHeader("content-type");
  const charset = contenttype.substring(contenttype.indexOf("charset=") + 8);
  const dataView = new DataView(xhr.response);
  const decoder = new TextDecoder(charset);
  console.log(decoder.decode(dataView));
}
xhr.open("GET", "https://people.w3.org/mike/tests/windows-1251/test.txt");
xhr.send(null);

fetch("https://people.w3.org/mike/tests/windows-1251/test.txt")
  .then(response => {
    const contenttype = response.headers.get("content-type");
    const charset = contenttype.substring(contenttype.indexOf("charset=") + 8);
    response.arrayBuffer()
      .then(ab => {
        const dataView = new DataView(ab);
        const decoder = new TextDecoder(charset);
        console.log(decoder.decode(dataView));
      })
  })
like image 82
Tom Tanner Avatar answered Oct 13 '22 01:10

Tom Tanner


If I used 'blob' (probably the nearest thing to what I want), could I then convert that to a DomString taking into account the encoding?

https://medium.com/programmers-developers/convert-blob-to-string-in-javascript-944c15ad7d52 outlines a general approach you can use. To apply that to the case of fetching a remote document:

  • Create a FileReader to read in the fetch response as a Blob
  • Use FileReader.readAsText() to get back text from that Blob in the right encoding

Like this:

const reader = new FileReader()
reader.addEventListener("loadend", function() {
  console.log(reader.result)
})
fetch("https://people.w3.org/mike/tests/windows-1251/test.txt")
  .then(response => response.blob())
  .then(blob => reader.readAsText(blob, "windows-1251"))

Or if you instead really want to use XHR:

const reader = new FileReader()
reader.addEventListener("loadend", function() {
  console.log(reader.result)
})
const xhr = new XMLHttpRequest()
xhr.responseType = "blob"
xhr.onload = function() {
  reader.readAsText(xhr.response, "windows-1251")
}
xhr.open("GET", "https://people.w3.org/mike/tests/windows-1251/test.txt", true)
xhr.send(null)

However, if I use responseType="text" it treats it as though the string is utf8, ignoring the charset in the content-type

Yes. That’s what’s required by the Fetch spec (which for this is what the XHR spec relies on too):

Objects implementing the Body mixin also have an associated package data algorithm, given bytes, a type and a mimeType, switches on type, and runs the associated steps:

text
            Return the result of running UTF-8 decode on bytes.

like image 25
4 revs Avatar answered Oct 13 '22 02:10

4 revs