Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMParser for large html

I have a large amount of html clipboard data from Excel, about 250MB (though it contains a lot of formatting, so when actually pasting it in, the data is much, much smaller than that).

Currently I am using the following DOMParser, which is just one line of code and everything happens behind the scenes:

const doc3 = parser.parseFromString(htmlString, "text/html");

However, it takes ~18s to parse this, and during this time the page is entirely blocking until it finishes -- or, if offloaded to a webworker, an action that gives no progress and just 'waits' for 18s until something ends up happening -- which I would argue is almost the same as freezing even though yes the user can literally interact with the page.

Is there an alternative way to parse a large html/xml file? Perhaps using something that doesn't load everything at once and so can be responsive, or what might be a good solution for this? I suppose the following might be inline with it? But not really sure: https://github.com/isaacs/sax-js.


Update: here is a sample Excel file: https://drive.google.com/file/d/1GIK7q_aU5tLuDNBVtlsDput8Oo1Ocz01/view?usp=sharing. You can download the file, open it in Excel, press Cmd-A (select-all), and Cmd-C (Copy), and it'll paste the data into your clipboard. For me copying it takes up 249MB for the text/html format in the clipboard.

Yes, it is also available in teext/plain (which we use as a backup), but the point of grabbing it from the text/html is to capture the formatting (both data formatting, for example numberType=Percent, 3 decimals and stylistic, for example, background color=red). Please use that as a test for any sample code. Here is the actual test/html content (in asci) when it's in the clipboard here: https://drive.google.com/file/d/1ZUL2A4Rlk3KPqO4vSSEEGBWuGXj7j5Vh/view?usp=sharing

like image 215
carl.hiass Avatar asked Mar 15 '21 20:03

carl.hiass


Video Answer


1 Answers

The problem here is not html file size but the large number of DOM nodes it contains. For 900000 rows and 8 columns in your html file we have these figures:

900000 (TR elements) * (8 (TD elements) + 8 (Text nodes)) = ~14 millions of DOM nodes!

I didn't manage to load it with DOMParser, browser tab crashes after a while (FF, Chrome, 16GB RAM), though it would be interesting to look at the browser behavior on successful load. Anyway, I had a similar challenge, to handle millions of records in browser, the solution that I came up with was to build table rows only for one screen at time.

Considering the structure of your text/html file, the approach could be next:

  1. use FileReader to load html file as raw text
  2. grab rows, save them as text array, remove them from output
  3. parse resulting output, insert the table and style into DOM
  4. use a view / paging, render the current batch of rows on paging/scroll or search
  5. attach events for mouse/keyboard control

Below is a simple implementation which provide basic controls like sizing view, paginate/scroll, filter rows with regular expressions. Note that filtering is done on row html, for text only search you can uncomment the line "//text: text.match...", though in this case the file parsing time will increase a bit.

let tbody, style;
let rows = [], view = [], viewSize = 20, page = 0, time = 0;

const load = fRead => {
    console.timeEnd('FILE LOAD');
    console.time('GRAB ROWS');
    let thead, trows = '', table = fRead.result
        .replace(/<tr[^]+<\/tr>/i, text => (trows += text) && '');
    console.timeEnd('GRAB ROWS');
    console.time('PARSE/INSERT TABLE & STYLE');
    const html = document.createElement('div');
    html.innerHTML = table;
    table = html.querySelector('table');
    if (!table || !trows) {
        setInfo('NO DATA FOUND');
        return;
    }
    if (style = html.querySelector('style'))
        document.head.appendChild(style);
    table.textContent = '';
    el('viewport').appendChild(table);
    console.timeEnd('PARSE/INSERT TABLE & STYLE');
    console.time('PREPARE ROWS ARRAY');
    rows = trows.split('<tr').slice(1).map(text => ({
        html: '<tr' + text, text,
        //text: text.match(/>.*<\/td>/gi).map(s => s.slice(1, -5)).join(' '),
    }));
    console.timeEnd('PREPARE ROWS ARRAY');
    console.time('RENDER TABLE');
    table.appendChild(thead = document.createElement('thead'));
    table.appendChild(tbody = document.createElement('tbody'));
    thead.innerHTML = rows[0].html;
    view = rows = rows.slice(1);
    renew();
    console.timeEnd('RENDER TABLE');
    console.timeEnd('INIT');
};

const reset = info => {
    el('info').textContent = info ?? '';
    el('viewport').textContent = '';
    style?.remove();
    style = null;
    tbody = null;
    view = rows = [];
};

const pages = () => Math.ceil(view.length / viewSize) - 1;

const renew = () => {
    if (!tbody)
        return;
    console.time('RENDER VIEW');
    const i = page * viewSize;
    tbody.innerHTML = view.slice(i, i + viewSize)
        .map(row => row.html).join('');
    console.timeEnd('RENDER VIEW');
    setInfo(`
        rows total: ${rows.length},
        rows match: ${view.length},
        pages: ${pages()}, page: ${page}
    `);
};

const gotoPage = num => {
    el('page').value = page = Math.max(0, Math.min(pages(), num));
    renew();
};

const fileInput = () => {
    reset('LOADING...');
    const fRead = new FileReader();
    fRead.onload = load.bind(null, fRead);
    console.time('INIT');
    console.time('FILE LOAD');
    fRead.readAsText(el('file').files[0]);
};

const fileReset = () => {
    reset();
    el('file').files = new DataTransfer().files;
};

const setInfo = text => el('info').innerHTML = text;

const setView = e => {
    let value = +e.target.value;
    value = Number.isNaN(value * 0) ? 20 : value;
    e.target.value = viewSize = Math.max(1, Math.min(value, 100));
    renew();
};

const setPage = e => {
    const page = +e.target.value;
    gotoPage(Number.isNaN(page * 0) ? 0 : page);
};

const setFilter = e => {
    const filter = e.target.value;
    let match;
    try {
        match = new RegExp(filter);
    } catch (e) {
        setInfo(e);
        return;
    }
    view = rows.filter(row => match.test(row.text));
    page = 0;
    renew();
};

const keys = {'PageUp': -1, 'PageDown': 1};

const scroll = e => {
    const dir = e.key ? keys[e.key] ?? 0 : Math.sign(-e.deltaY);
    if (!dir)
        return;
    e.preventDefault();
    gotoPage(page += dir);
};

const el = id => document.getElementById(id);

el('file').addEventListener('input', fileInput);
el('reset').addEventListener('click', fileReset);
el('view').addEventListener('input', setView);
el('page').addEventListener('input', setPage);
el('filter').addEventListener('input', setFilter);
el('viewport').addEventListener('keydown', scroll);
el('viewport').addEventListener('wheel', scroll);
div {
    display: flex;
    flex: 1;
    align-items: center;
    white-space: nowrap;
}
thead td,
tbody tr td:first-child {
    background: grey;
    color: white;
}
td { padding: 0 .5em; }
#menu > * { margin: 0 .25em; }
#file { min-width: 16em; }
#view, #page { width: 8em; }
#filter { flex: 1; }
#info { padding: .5em; color: red; }
<div id="menu">
    <span>FILE:</span>
        <input id="file" type="file" accept="text/html">
        <button id="reset">RESET</button>
    <span>VIEW:</span><input id="view" type="number" value="20">
    <span>PAGE:</span><input id="page" type="number" value="0">
    <span>FILTER:</span><input id="filter">
</div>
<div id="info"></div>
<div id="viewport" tabindex="0"></div>

As result, for 262 MB html file (900000 table rows) we have next timings in Chromium:

FILE LOAD: 352.57421875 ms

GRAB ROWS: 700.1943359375 ms

PARSE/INSERT TABLE & STYLE: 0.78125 ms

PREPARE ROWS ARRAY: 755.763916015625 ms

RENDER VIEW: 0.926025390625 ms

RENDER TABLE: 4.317138671875 ms

INIT: 1814.19287109375 ms

RENDER VIEW: 5.275146484375 ms

RENDER VIEW: 4.6318359375 ms

So, the time till render of first batch of rows (time to screen) is ~1.8 s, i.e. an order of magnitude lower than the time spent with DOMParser as specified by OP, subsequent rows render is almost instant: ~5 ms

like image 99
syduki Avatar answered Oct 06 '22 03:10

syduki