I have a large amount of html clipboard data from Excel, about 250MB (though it contains a lot of formatting, so when actually pasting it in, the data is much, much smaller than that).
Currently I am using the following DOMParser
, which is just one line of code and everything happens behind the scenes:
const doc3 = parser.parseFromString(htmlString, "text/html");
However, it takes ~18s to parse this, and during this time the page is entirely blocking until it finishes -- or, if offloaded to a webworker, an action that gives no progress and just 'waits' for 18s until something ends up happening -- which I would argue is almost the same as freezing even though yes the user can literally interact with the page.
Is there an alternative way to parse a large html/xml file? Perhaps using something that doesn't load everything at once and so can be responsive, or what might be a good solution for this? I suppose the following might be inline with it? But not really sure: https://github.com/isaacs/sax-js.
Update: here is a sample Excel file: https://drive.google.com/file/d/1GIK7q_aU5tLuDNBVtlsDput8Oo1Ocz01/view?usp=sharing. You can download the file, open it in Excel, press Cmd-A (select-all), and Cmd-C (Copy), and it'll paste the data into your clipboard. For me copying it takes up 249MB for the text/html format in the clipboard.
Yes, it is also available in teext/plain (which we use as a backup), but the point of grabbing it from the text/html is to capture the formatting (both data formatting, for example numberType=Percent, 3 decimals and stylistic, for example, background color=red). Please use that as a test for any sample code. Here is the actual test/html
content (in asci) when it's in the clipboard here: https://drive.google.com/file/d/1ZUL2A4Rlk3KPqO4vSSEEGBWuGXj7j5Vh/view?usp=sharing
The problem here is not html
file size but the large number of DOM nodes it contains. For 900000 rows and 8 columns in your html
file we have these figures:
900000 (TR elements) * (8 (TD elements) + 8 (Text nodes)) = ~14 millions of DOM nodes!
I didn't manage to load it with DOMParser
, browser tab crashes after a while (FF, Chrome, 16GB RAM), though it would be interesting to look at the browser behavior on successful load.
Anyway, I had a similar challenge, to handle millions of records in browser, the solution that I came up with was to build table rows only for one screen at time.
Considering the structure of your text/html
file, the approach could be next:
- use
FileReader
to load html file as raw text- grab rows, save them as text array, remove them from output
- parse resulting output, insert the table and style into DOM
- use a view / paging, render the current batch of rows on paging/scroll or search
- attach events for mouse/keyboard control
Below is a simple implementation which provide basic controls like sizing view, paginate/scroll, filter rows with regular expressions. Note that filtering is done on row html
, for text
only search you can uncomment the line "//text: text.match...", though in this case the file parsing time will increase a bit.
let tbody, style;
let rows = [], view = [], viewSize = 20, page = 0, time = 0;
const load = fRead => {
console.timeEnd('FILE LOAD');
console.time('GRAB ROWS');
let thead, trows = '', table = fRead.result
.replace(/<tr[^]+<\/tr>/i, text => (trows += text) && '');
console.timeEnd('GRAB ROWS');
console.time('PARSE/INSERT TABLE & STYLE');
const html = document.createElement('div');
html.innerHTML = table;
table = html.querySelector('table');
if (!table || !trows) {
setInfo('NO DATA FOUND');
return;
}
if (style = html.querySelector('style'))
document.head.appendChild(style);
table.textContent = '';
el('viewport').appendChild(table);
console.timeEnd('PARSE/INSERT TABLE & STYLE');
console.time('PREPARE ROWS ARRAY');
rows = trows.split('<tr').slice(1).map(text => ({
html: '<tr' + text, text,
//text: text.match(/>.*<\/td>/gi).map(s => s.slice(1, -5)).join(' '),
}));
console.timeEnd('PREPARE ROWS ARRAY');
console.time('RENDER TABLE');
table.appendChild(thead = document.createElement('thead'));
table.appendChild(tbody = document.createElement('tbody'));
thead.innerHTML = rows[0].html;
view = rows = rows.slice(1);
renew();
console.timeEnd('RENDER TABLE');
console.timeEnd('INIT');
};
const reset = info => {
el('info').textContent = info ?? '';
el('viewport').textContent = '';
style?.remove();
style = null;
tbody = null;
view = rows = [];
};
const pages = () => Math.ceil(view.length / viewSize) - 1;
const renew = () => {
if (!tbody)
return;
console.time('RENDER VIEW');
const i = page * viewSize;
tbody.innerHTML = view.slice(i, i + viewSize)
.map(row => row.html).join('');
console.timeEnd('RENDER VIEW');
setInfo(`
rows total: ${rows.length},
rows match: ${view.length},
pages: ${pages()}, page: ${page}
`);
};
const gotoPage = num => {
el('page').value = page = Math.max(0, Math.min(pages(), num));
renew();
};
const fileInput = () => {
reset('LOADING...');
const fRead = new FileReader();
fRead.onload = load.bind(null, fRead);
console.time('INIT');
console.time('FILE LOAD');
fRead.readAsText(el('file').files[0]);
};
const fileReset = () => {
reset();
el('file').files = new DataTransfer().files;
};
const setInfo = text => el('info').innerHTML = text;
const setView = e => {
let value = +e.target.value;
value = Number.isNaN(value * 0) ? 20 : value;
e.target.value = viewSize = Math.max(1, Math.min(value, 100));
renew();
};
const setPage = e => {
const page = +e.target.value;
gotoPage(Number.isNaN(page * 0) ? 0 : page);
};
const setFilter = e => {
const filter = e.target.value;
let match;
try {
match = new RegExp(filter);
} catch (e) {
setInfo(e);
return;
}
view = rows.filter(row => match.test(row.text));
page = 0;
renew();
};
const keys = {'PageUp': -1, 'PageDown': 1};
const scroll = e => {
const dir = e.key ? keys[e.key] ?? 0 : Math.sign(-e.deltaY);
if (!dir)
return;
e.preventDefault();
gotoPage(page += dir);
};
const el = id => document.getElementById(id);
el('file').addEventListener('input', fileInput);
el('reset').addEventListener('click', fileReset);
el('view').addEventListener('input', setView);
el('page').addEventListener('input', setPage);
el('filter').addEventListener('input', setFilter);
el('viewport').addEventListener('keydown', scroll);
el('viewport').addEventListener('wheel', scroll);
div {
display: flex;
flex: 1;
align-items: center;
white-space: nowrap;
}
thead td,
tbody tr td:first-child {
background: grey;
color: white;
}
td { padding: 0 .5em; }
#menu > * { margin: 0 .25em; }
#file { min-width: 16em; }
#view, #page { width: 8em; }
#filter { flex: 1; }
#info { padding: .5em; color: red; }
<div id="menu">
<span>FILE:</span>
<input id="file" type="file" accept="text/html">
<button id="reset">RESET</button>
<span>VIEW:</span><input id="view" type="number" value="20">
<span>PAGE:</span><input id="page" type="number" value="0">
<span>FILTER:</span><input id="filter">
</div>
<div id="info"></div>
<div id="viewport" tabindex="0"></div>
As result, for 262 MB html file (900000 table rows) we have next timings in Chromium:
FILE LOAD: 352.57421875 ms
GRAB ROWS: 700.1943359375 ms
PARSE/INSERT TABLE & STYLE: 0.78125 ms
PREPARE ROWS ARRAY: 755.763916015625 ms
RENDER VIEW: 0.926025390625 ms
RENDER TABLE: 4.317138671875 ms
INIT: 1814.19287109375 ms
RENDER VIEW: 5.275146484375 ms
RENDER VIEW: 4.6318359375 ms
So, the time till render of first batch of rows (time to screen) is ~1.8 s
, i.e. an order of magnitude lower than the time spent with DOMParser
as specified by OP, subsequent rows render is almost instant: ~5 ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With