What is a token?
As for Content Generation API, we use the word “token” to represent an element in the generation sequence. Tokens are not tied to the beginning and end of a word – they can include some of the letters, spaces, or sub-words. How the text splits into tokens depends on the language. For English text, 1 token is approximately 4 characters or 0.75 words.
- The word “paraphrase” consists of ten characters but amounts to three tokens;
- “Phrase” is six characters and only one token;
- The phrase “How are you?” is three words, twelve characters, and four tokens.
Why don’t we use “word” instead?
The reason lies in the generally accepted technological terminology. Not all API fields contain the word “token” in their name, although they also deal with word generation. “Tokens” are precisely the elements that the system produces. They can take on a different form, but in this case, it coincided with the meaning of the “word”. We use the “word” when we are dealing with its linguistic concept as a connected part of the text, which carries certain information.
For instance, take a look at some Content Generation – Generate API parameters:
max_new_tokens– meaning a maximum number of new elements the model can produce
avoid_words– meaning to avoid certain parts of the text – words or phrases – which can negatively affect the quality of the text as a way of conveying information
How to count how many tokens are in my text?
DataForSEO has taken care of this issue. To easily count the number of tokens in any piece of content, use our Tokenizer tool. You can find it in the API Dashboard in the Content Generation API Explorer section.
Click the “Switch to Tokenizer” button. All you need to do is paste the text into the field and you will get the exact number.
You won’t be charged for using this tool.