Tokens

The easy summary

You've got Microsoft Copilot open. You send it the message "Where can I see the horses?"

The AI breaks the message apart and maps each word to a number:

[
  { "Where", 1 },
  { "can", 2 },
  { "I", 3 },
  { "see", 4 },
  { "the", 5 },
  { "horses", 6 },
  { "?", 7 }
]

The individual words are considered tokens. When you send it a new sentence, it uses this word-to-number mapping to figure out what you told it.

The medium summary

AI systems are powered by Large Language Models (LLMs). When you send a message in Microsoft Copilot, that message is broken into individual pieces before it reaches the LLM. These individual pieces are the words, individual characters, or even combinations of words and punctuation, and they are each a token.

Each message is broken apart into tokens with a tokenizer, a piece of software designed to read sentences and figure out how to break them apart.

LLMs are written with specific tokenizer methods in mind. Some common methods:

Word tokenization, where text is split into individual words based on a delimiter (spaces, punctuation, etc.)
Character tokenization, where text is split into individual characters, including the spaces and punctuation
Subword tokenization, where text is split into partial words or sets of characters (i.e., using prefixes, suffixes, letter groupings shared between types of words, etc to establish relationships between subwords)

Microsoft's page on tokens contains this table that gives a brief overview of the pros and cons of each tokenizer method.

Once a sentence is tokenized, it references its internal store of mappings to figure out what you've sent it, how it should respond, and then how to build the response, token by token.