Using JavaScript to truncate text to a certain size (8 KB)

Tags:

I'm using the Zemanta API, which accepts up to 8 KB of text per call. I'm extracting the text to send to Zemanta from Web pages using JavaScript, so I'm looking for a function that will truncate my text at exactly 8 KB.

Zemanta should do this truncation on its own (i.e., if you send it a larger string), but I need to shuttle this text around a bit before making the API call, so I want to keep the payload as small as possible.

Is it safe to assume that 8 KB of text is 8,192 characters, and to truncate accordingly? (1 byte per character; 1,024 characters per KB; 8 KB = 8,192 bytes/characters) Or, is that inaccurate or only true given certain circumstances?

Is there a more elegant way to truncate a string based on its actual file size?

515

asked Oct 04 '09 08:10

Bungle

2 Answers

If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.

(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)

If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:

function toBytesUTF8(chars) {
    return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
    return decodeURIComponent(escape(bytes));
}

Now you can truncate with:

function truncateByBytesUTF8(chars, n) {
    var bytes= toBytesUTF8(chars).substring(0, n);
    while (true) {
        try {
            return fromBytesUTF8(bytes);
        } catch(e) {};
        bytes= bytes.substring(0, bytes.length-1);
    }
}

(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)

If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.

178

answered Oct 21 '22 02:10

bobince

No it's not safe to assume that 8KB of text is 8192 characters, since in some character encodings, each character takes up multiple bytes.

If you're reading the data from files, can't you just grab the filesize? Or read it in in chunks of 8KB?

answered Oct 21 '22 04:10

Dominic Rodger

Related questions
                            
                                Javascript - Run my script only if landscape is detected
                            
                                Web Share API permission missing
                            
                                how sort array with object by property use ngFor
                            
                                How to overwrite Material UI tooltip inline styles?
                            
                                Passing on:click event into dynamically created <svelte:component/>
                            
                                Functional component renders once, class component renders twice
                            
                                Show notification on foreground react native firebase v6
                            
                                How to resolve "Definition for rule '@typescript-eslint/rule-name' was not found"
                            
                                GET request returns index.html doc instead of json data
                            
                                How to properly use useHistory () from react-router-dom?
                            
                                Mocks broken after updating to Jest 26
                            
                                Async arrow function expected no return value
                            
                                Webpack 5: file-loader generates a copy of fonts with hash-name
                            
                                How to make cross-domain communication between JavaScript and Flash?
                            
                                Best UI Library to use with jQuery [closed]
                            
                                Getting a div's background image with jQuery. Is there an inbuilt method to strip out the url() portion?
                            
                                javascript crossbrowser new Image()
                            
                                Javascript closure
                            
                                Passing JSON-encoded variable from PHP to Javascript via POST
                            
                                Pushing to an Array within a jQuery each loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using JavaScript to truncate text to a certain size (8 KB)

Tags:

javascript

text

byte

truncate

zemanta

Bungle

People also ask

2 Answers

bobince

Dominic Rodger

Recent Activity

Donate For Us