Simple Introduction To LLM Tokens With Examples

This article is meant to be an overview to understand tokens and how our words or sentences turn into tokens.

For beginners, I hope you can understand and start to know about certain terms in LLM and how tokens can be complex for us humans. For software engineers or professionals, who are already actively interacting with LLM but don’t have time yet to understand tokens and tokenizers, I hope this can give you an insight or can be a reference on how you can see your tokenizers work with different texts.

With that’s all let’s start!

What’s a token?

Tokens are words, character sets, or combinations of words and punctuation that are generated by large language models (LLMs) when they decompose text. Tokenization is the first step in training. The LLM analyzes the semantic relationships between tokens, such as how commonly they’re used together or whether they’re used in similar contexts. After training, the LLM uses those patterns and relationships to generate a sequence of output tokens based on the input sequence

– Taken from Understand Tokens — Microsoft

A token is a unit of text that the model processes. Depending on the context, it can be as short as one character or as long as a whole word. For example, the word “chat” might be one token, while “unhappiness” could be split into multiple tokens like [un] [happi] [ness].

Tokens are crucial because they determine how much text an LLM can handle at one time and impact costs associated with processing text.

Why are tokens so important?

So token is something fundamental to understanding how LLM works. This includes:

How LLM (Large Language Model) works, as the text quoted above said, tokenization is the first step in training!
Estimate cost or payment, most LLM model uses token-based pricing
Understand how tokens relate to Context Window
Create better LLM apps
Debugging your LLM apps

With all due reason, let’s start understanding this new term.

How are we going to understand tokens?

Well, it’s simple, we can try by seeing it itself. Working with the code that transforms your words into tokens. They are called tokenizers which are used in the process of tokenization. So we are going to play around with it!

So I want you to open up the most well-known tokenizer, ChatGPT tokenizer! I really recommend you try it on your own, have it side by side with this article, and continue! But if you weren’t able to, you can just read mine and relax. So let’s start decomposing tokens!

What does a token look like?

You can imagine it as an English Language Structure. It starts with: Characters — Words — Sentences. Now, a token is a combination of “character(s)” and “words”, so you can think of it like this, Characters — Tokens — Words — Sentences.

However, despite what it looks like tokens are only specific to machines! Let me show you how you can see tokens for ChatGPT! Open up the ChatGPT’s tokenizer and copy the following words.

You can use your name, of course, to see how LLM breaks down your name!

Hello, world! My name is Irfandy. I’m learning about tokens, from irfandy.dev with articles titled “LLM Tokens Simplified: Modern Samples That Made You Think in Tokens” with the hope of understanding tokens simply by reading a simple article. I’m also using ChatGPT’s tokenizer to count how many tokens I have created.

You know, at first, I thought that tokens were like syllables, but I was wrong! Each color represented is the tokens. Notice the following words:

[Hello] [,] [ world] [!] → is 4 tokens, notice how [ world] has space! Even the space has a meaning to the LLM!
[Ir] [f] [andy] → my name is 3 tokens! It doesn’t represent syllables! If it were it should be [Ir] [fan] [dy].
[Simpl] [ified] → is 2 tokens! If you think it’s kind of like -ified a suffix, you’re right! Try out the words terrified or mortified. However, notice the difference between the word simple and simply which is counted as 1 token each! Apparently, although there are ple and ply words it doesn’t work with words like multiple and multiply. Proofing it’s not always applying suffixes on everything.

The question is, what else is treated differently? Let’s give it more words!

While you are learning, I wanted to tell you some of my complaints. 😤

How many times do I have to tell you all that Skibidi Toilets is a Gen Alpha thing! I’m a Gen Z!!! Most likely Zillenial because I’m a bit in between the transition y’know. I DON’T EVEN KNOW WHAT THAT IS. STOP COMPLAINING ABOUT GEN Z FOR IT! You know, you guys could’ve searched firsts what that is! I’m complaining for the sake of the lesson!

Notice the following words:

[ �] [�] → This is the 😤 emoji! It counts as 2 tokens, however, they don’t actually take those ?? as what it is, but rather it’s something like [ \\24] [\\25]. Some emojis are worth 4 tokens though! Try 🤚🏾!
[Sk] [ib] [idi] [ Toile] [ts] → Not only this is a new word, this isn’t a brand yet, and it’s something new. But they are trying to scan even new words this way. Also, I thought toilets were going to be normal as well like [toilets] or [toilet] [s].
[ complaints], [ articles]→ At first I thought it’s going to be something like [ complaint][s] but some common words are so common, that LLM took it as 1 token.
[COM] [PL] [AIN] [ING] → Surprisingly, I thought this was going to be 1 token because it’s a common word. Hmmm… Could raging customers typing in all caps cost you more?
[ complaining] and [COM] [PL] [AIN] [ING] → is treated differently! Also, they treat capitalization differently as well! Explained a little bit later.
[y] ['] [know] and [ You] [know] → the first is treated as 3 tokens, and the latter is 2 tokens. Using actual proper English can save you money!

There is a whole sentence with all capitalized letters, and the same word/characters although look the same would mean differently, just as if those sentences look like a rant! It’s all numbers to them though behind the scenes!

Let’s continue to the next part!

I mean why toilets?

Why couldn’t it be 3000 polar bears attacking 3,000,000 kangaroos for territory or something? Maybe the polar bear decided they had enough with global warming and decided to stay in hot climate, they start infecting the kangaroos and turning into something like uh… 250000 beangaroos? Imagine the bottom agility of a kangaroo and top strength of a polar bear!

Oh that’s something Gen Y or X animators would do.

I hope everyone understand that this text is just a demonstration. The truth is I LUV EVERYBADEY.

Notice the following words:

[ toilets] → Well we have seen changes! Apparently, they mean differently to this tokenizer, it depends on where they are and the context, unlike the [ Toile] [ts] above.
[300] [0], [3] [,] [000] [,] [000], [250] [000] → Did you see how they read numbers? This is how the LLM sees numbers! It’s something like every 3 characters’ new tokens. This is one of their limitations, although has been fixed. Somewhere in 2023, there are memes about ChatGPT not being able to count. Although, if you try it in newer models, you can see that they can count already.
[ I] [ L] [UV] [ EVERY] [BA] [DE] [Y] [.] → Apparently reading a text like this costs 8, unlike the normal ones which will cost you 4 tokens (”I LOVE EVERYBODY”). Apparently, they have a harder understanding of eye dialect.
[ be] [angaro] [os] → While this isn’t new just like Skibidi, try out asking ChatGPT things like this, try to ask ChatGPT whether they know what you are trying to say!

Bonus: Writing in all caps can cost you more!

An even more interesting one before we get to another point is while I’m writing this, I found something funny! Writing all in cap does cost you more tokens! Be wary of developers creating customer service chatbots!

Compare these 2 sentences, you can try out yourselves with other words.

WHY IS IT SO HARD TO COMPLAIN ABOUT YOUR HORRIBLE INTERNET CONNECTION SERVICES?

Why is it so hard to complain about your horrible internet connection services?

Did you see the difference? Try it out yourselves!

And that’s the end of my lesson about tokens. Let’s conclude this in a few sentences.

Conclusion: What can we learn from this?

Let’s take a look again at the quoted sentence from the top!

Tokens are words, character sets, or combinations of words and punctuation that are generated by large language models (LLMs) when they decompose text …

Now, I hope you can see this sentence differently or in fact, visualize it a bit since you have more examples in mind! We can conclude that LLM doesn’t use characters or words, but tokens! It uses tokens, a unique language structure to the model, and can be complex to us.

And aside from that sentence, you might also consider what we’ve learned so far! Like the following,

Tokenizer plays an important role in turning your words into tokens.
Numbers are read differently (this will be important stuff!).
Uncommon words cost more tokens! Which means it will cost you more a little. You can see it when I use the word “Irfandy”, “Skibidi”, and “beangaroos”.
Apparently, writing in all caps costs you more tokens!

After learning about what tokens look like and how abstract they can be for us to count, hopefully, you will be able to expand your knowledge about tokens and now able to understand more of them. While there are more into it, hopefully, this does open up a new window to let you see how the LLM sees your words and I hope you can even tell your friends what tokens are and show them using the tokenizer.

But wait, there’s more!

Hi, my name is Irfandy, I wanted to share my knowledge while learning LLM.

I hope I can help expand your knowledge and help you grow in this field. I bet there are more questions happening in your mind than it was now that you know how weird can a token be. But there are no worries! While this article ends, this doesn’t have to be the end of your learning.

Let me get you started with more questions to understand tokens!

Why does knowing tokens matter?
How does these tokens fed to the LLM?
What if an LLM meets with unknown tokens? What if it met with another language?
How do these tokens connect with the development of LLMs?
How do they react to numbers when they scan them differently?
Can lowering or uppercasing the messages affect the LLM response?
Can lowering the messages sent make the cost effective?

If you like my writings and want to support me send me a message at X or at hi@irfandy.dev and tell me how you love my articles!

References:

https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens