Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a JavaScript implementation of cl100k_base tokenizer?

OpenAI's new embeddings API uses the cl100k_base tokenizer. I'm calling it from the Node.js client, but I don't see any easy way of slicing my strings so they don't exceed the OpenAI limit of 8192 tokens.

This would be trivial if I could first encode the string, slice it to the limit, then decode it and send it to the API.

like image 797
Daniel Patrick Avatar asked Jan 30 '26 00:01

Daniel Patrick


1 Answers

Update: David Duong created a JavaScript port of openai/tiktoken with JS/WASM bindings. The package can be installed via npm:

npm install tiktoken

Credit to Lars Grammel's answer below for the discovery/update.


Original interim solution (before the aforementioned package was available):

There is a general rule of thumb that one token corresponds to approximately four characters of common English text. This roughly translates to one token being equal to 3/4 of a word. So in your case, a limit of 8,192 tokens ~= 6,144 words. Therefore, you could slice your strings such that they don't exceed ~6,144 words (e.g., set a 6,100 word limit. If that fails, reduce the limit further until you find one that is suitable).

like image 81
Kyle F. Hartzenberg Avatar answered Feb 01 '26 17:02

Kyle F. Hartzenberg