top of page

LLM Practitioner's Guide: How Multilingual Falcon, Mistral, Smaug, and other LLMs Are?

So, you want to fine-tune a large language model for your own language. What model should you use?

The world of open-access LLMs is rapidly expanding, with hardly a week passing without the announcement of a new benchmark leader. At the time of writing this post, the Huggingface Open LLM Leaderboard has crowned a new champion: Smaug LLM. This field is incredibly dynamic, reminding us of the bustling IBM PC clone market in the late 1980s to early 1990s.

Despite the variety, many of these models share common origins. Smaug, for example, is based on Qwen-72B. While fine-tuning allows for significant adjustments to a model's behavior, one aspect you can't alter is its tokenizer.

How Multilingual Falcon, Mistral, Smaug, and other LLMs Are?

We've been deeply involved with customizing, fine-tuning, and deploying various open-access LLMs. Through this series of posts, we aim to share our insights on what they can and can't do, along with detailed instructions and best practices for fine-tuning.

LLM Practitioner's Guide:

In a previous post, we explored the Llama-2 tokenizer and discussed why it struggles with certain languages. Today, we're conducting comprehensive tokenizer tests on six of the most popular LLMs across 12 languages. Our goal is to determine which models are the most adaptable for each language. We'll also share the source code for our experiments, enabling you to include any language you wish in this benchmark.

All large language models (LLMs), whether proprietary like ChatGPT or open-source, process information using numbers. To interpret text, these models initially convert it into numerical form. The main challenge in natural language processing (NLP) lies in finding the most efficient method to represent languages as numbers.

A simple approach assigns each letter or symbol a unique number. For example, "a" might be 1, "b" might be 2, and so on. The model then attempts to predict the next number in a sequence. However, this approach can turn a brief sentence like "Predicting the next word with a large language model is easy." into a lengthy sequence of 61 numbers. Any error in predicting a single letter can affect all subsequent predictions, making this method both complex and prone to mistakes.

Another strategy assigns each word a unique number. This way, the sentence is reduced to a sequence of just 11 numbers, much shorter than encoding each letter individually. While simpler for the model, this method faces its own challenge: the English vocabulary, which contains over 200,000 words. For models that support multiple languages, this could mean navigating through millions of numbers, raising the error margin.

To address these challenges, researchers have developed a method known as tokenization. Tokenization breaks down words into smaller units, or tokens, that are common across various words. For example, "predicting" might be divided into "pred," "ict," and "ing." This approach means the model doesn't have to predict each letter or entire word but rather smaller, more manageable sequences of tokens. By significantly reducing the number of options the model needs to consider—from an entire dictionary to roughly 2,000 tokens for English—it makes the process both more efficient and accurate.

Let's examine how today's most popular LLM tokenizers handle various languages. We will be testing the following models:

  • Mixtral-8x7B-Instruct-v0.1

  • Mistral-7B-Instruct-v0.2

  • falcon-7b-instruct

  • Llama-2-7b-chat-hf

  • Smaug-72B-v0.1

  • mt5

We included the mt5 LLM in our lineup because it boasts one of the best multilingual tokenizers, making it an excellent benchmark for comparing the other models on our list.

First, we need to import the tokenizers:

from transformers import AutoTokenizer

tokenizers = {'Mixtral-8x7B-Instruct-v0.1': AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1"),
    'Mistral-7B-Instruct-v0.2': AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2"),
    'falcon-7b-instruct': AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct"),
    'Llama-2-7b-chat-hf': AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf"),
    'Smaug-72B-v0.1': AutoTokenizer.from_pretrained("abacusai/Smaug-72B-v0.1"),
    'mt5': AutoTokenizer.from_pretrained("google/mt5-small")
}

Next, we'll select languages and their well-known sentences for testing. If you're interested in adding another language to the test, simply include it as a new entry in the dictionary below.

sentences = {
    'English': 'Call me Ishmael. Some years ago — never mind how long precisely — having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.',
    'German': 'Da steh ich nun, ich armer Tor! Und bin so klug als wie zuvor.',
    'French': "On ne voit bien qu'avec le cœur. L'essentiel est invisible pour les yeux.",
    'Spanish': "En un lugar de la Mancha, de cuyo nombre no quiero acordarme...",
    'Polish': 'Litwo, ojczyzno moja! Ty jesteś jak zdrowie; Ile cię trzeba cenić, ten tylko się dowie, kto cię stracił.',
    'Ukrainian': 'Як умру, то поховайте мене на могилі серед степу широкого на Вкраїні милій.',
    'Greek': 'ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ',
    'Hebrew': 'בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ',
    'Arabic': "حدثني أبي، عن جدي، عن أجداده",
    'Hindi': "कर्मण्येवाधिकारस्ते मा फलेषु कदाचन। मा कर्मफलहेतुर्भूर्मा ते संगोऽस्त्वकर्मणि।",
    'Chinese': "道可道,非常道。名可名,非常名。",
    'Japanese': "私は、その男の写真を三葉、見たことがある。"
}

Next, let's define the testing process:

def tokenizer_test(sentences):
    results = {}
    for model_name, tokenizer in tokenizers.items():
        results[model_name] = {}
        for language, sentence in sentences.items():
            token_list = tokenizer.tokenize(sentence)
            results[model_name][language] = {'length': len(token_list), 'tokens': token_list}
    return results

With this function in place, we're all set to conduct our test.

results = tokenizer_test(sentences)

To display these results in a format that's easy to digest for our blog readers, we'll perform some additional transformations:

from collections import defaultdict
language_map = defaultdict(lambda: {})
for model, result in results.items():
    for language, record in result.items():
        language_map[language][model] = record

full_records = []
for language, record in language_map.items():
    language_records = []
    for model, result in record.items():
        language_records.append(f'Model: {model}. Token count: {result["length"]}\nTokens: {result["tokens"]}')
    language_records_str = '\n\n'.join(language_records)
    full_records.append(f'LANGUAGE: {language}\nOriginal Text: "{sentences[language]}"\n\nTokenization Results:\n{language_records_str}')

Here are the results. It's important to note that a smaller number of tokens generally indicates a more efficient tokenizer: given the same input, a shorter list of tokens suggests a better understanding of the context. For LLMs, working with shorter sequences is also more manageable than dealing with longer ones.

LANGUAGE: English. TEXT: Call me Ishmael. Some years ago — never mind how long precisely — having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

Model: Mixtral-8x7B-Instruct-v0.1. Length: 54
['▁Call', '▁me', '▁I', 'sh', 'ma', 'el', '.', '▁Some', '▁years', '▁ago', '▁—', '▁never', '▁mind', '▁how', '▁long', '▁precisely', '▁—', '▁having', '▁little', '▁or', '▁no', '▁money', '▁in', '▁my', '▁pur', 'se', ',', '▁and', '▁nothing', '▁particular', '▁to', '▁interest', '▁me', '▁on', '▁shore', ',', '▁I', '▁thought', '▁I', '▁would', '▁sail', '▁about', '▁a', '▁little', '▁and', '▁see', '▁the', '▁water', 'y', '▁part', '▁of', '▁the', '▁world', '.']

Model: Mistral-7B-Instruct-v0.2. Length: 54
['▁Call', '▁me', '▁I', 'sh', 'ma', 'el', '.', '▁Some', '▁years', '▁ago', '▁—', '▁never', '▁mind', '▁how', '▁long', '▁precisely', '▁—', '▁having', '▁little', '▁or', '▁no', '▁money', '▁in', '▁my', '▁pur', 'se', ',', '▁and', '▁nothing', '▁particular', '▁to', '▁interest', '▁me', '▁on', '▁shore', ',', '▁I', '▁thought', '▁I', '▁would', '▁sail', '▁about', '▁a', '▁little', '▁and', '▁see', '▁the', '▁water', 'y', '▁part', '▁of', '▁the', '▁world', '.']

Model: falcon-7b-instruct. Length: 53
['Call', 'Ġme', 'ĠIs', 'hma', 'el', '.', 'ĠSome', 'Ġyears', 'Ġago', 'Ġ', 'âĢĶ', 'Ġnever', 'Ġmind', 'Ġhow', 'Ġlong', 'Ġprecisely', 'Ġ', 'âĢĶ', 'Ġhaving', 'Ġlittle', 'Ġor', 'Ġno', 'Ġmoney', 'Ġin', 'Ġmy', 'Ġpurse', ',', 'Ġand', 'Ġnothing', 'Ġparticular', 'Ġto', 'Ġinterest', 'Ġme', 'Ġon', 'Ġshore', ',', 'ĠI', 'Ġthought', 'ĠI', 'Ġwould', 'Ġsail', 'Ġabout', 'Ġa', 'Ġlittle', 'Ġand', 'Ġsee', 'Ġthe', 'Ġwatery', 'Ġpart', 'Ġof', 'Ġthe', 'Ġworld', '.']

Model: Llama-2-7b-chat-hf. Length: 54
['▁Call', '▁me', '▁I', 'sh', 'ma', 'el', '.', '▁Some', '▁years', '▁ago', '▁—', '▁never', '▁mind', '▁how', '▁long', '▁precisely', '▁—', '▁having', '▁little', '▁or', '▁no', '▁money', '▁in', '▁my', '▁pur', 'se', ',', '▁and', '▁nothing', '▁particular', '▁to', '▁interest', '▁me', '▁on', '▁shore', ',', '▁I', '▁thought', '▁I', '▁would', '▁sail', '▁about', '▁a', '▁little', '▁and', '▁see', '▁the', '▁wat', 'ery', '▁part', '▁of', '▁the', '▁world', '.']

Model: Smaug-72B-v0.1. Length: 52
['Call', 'Ġme', 'ĠIsh', 'ma', 'el', '.', 'ĠSome', 'Ġyears', 'Ġago', 'ĠâĢĶ', 'Ġnever', 'Ġmind', 'Ġhow', 'Ġlong', 'Ġprecisely', 'ĠâĢĶ', 'Ġhaving', 'Ġlittle', 'Ġor', 'Ġno', 'Ġmoney', 'Ġin', 'Ġmy', 'Ġpurse', ',', 'Ġand', 'Ġnothing', 'Ġparticular', 'Ġto', 'Ġinterest', 'Ġme', 'Ġon', 'Ġshore', ',', 'ĠI', 'Ġthought', 'ĠI', 'Ġwould', 'Ġsail', 'Ġabout', 'Ġa', 'Ġlittle', 'Ġand', 'Ġsee', 'Ġthe', 'Ġwat', 'ery', 'Ġpart', 'Ġof', 'Ġthe', 'Ġworld', '.']

Model: mt5. Length: 59
['▁Call', '▁me', '▁I', 'shma', 'el', '.', '▁Some', '▁years', '▁ago', '▁—', '▁never', '▁mind', '▁how', '▁long', '▁precis', 'ely', '▁—', '▁', 'having', '▁little', '▁or', '▁no', '▁money', '▁in', '▁my', '▁', 'purse', ',', '▁and', '▁', 'nothing', '▁particular', '▁to', '▁interest', '▁me', '▁on', '▁', 'shore', ',', '▁I', '▁thought', '▁I', '▁', 'would', '▁sail', '▁about', '▁', 'a', '▁little', '▁and', '▁see', '▁the', '▁water', 'y', '▁part', '▁of', '▁the', '▁world', '.']

LANGUAGE: German. TEXT: Da steh ich nun, ich armer Tor! Und bin so klug als wie zuvor.

Model: Mixtral-8x7B-Instruct-v0.1. Length: 21
['▁Da', '▁ste', 'h', '▁ich', '▁nun', ',', '▁ich', '▁ar', 'mer', '▁Tor', '!', '▁Und', '▁bin', '▁so', '▁kl', 'ug', '▁als', '▁wie', '▁zu', 'vor', '.']

Model: Mistral-7B-Instruct-v0.2. Length: 21
['▁Da', '▁ste', 'h', '▁ich', '▁nun', ',', '▁ich', '▁ar', 'mer', '▁Tor', '!', '▁Und', '▁bin', '▁so', '▁kl', 'ug', '▁als', '▁wie', '▁zu', 'vor', '.']

Model: falcon-7b-instruct. Length: 21
['Da', 'Ġste', 'h', 'Ġich', 'Ġnun', ',', 'Ġich', 'Ġar', 'mer', 'ĠTor', '!', 'ĠUnd', 'Ġbin', 'Ġso', 'Ġkl', 'ug', 'Ġals', 'Ġwie', 'Ġzu', 'vor', '.']

Model: Llama-2-7b-chat-hf. Length: 21
['▁Da', '▁ste', 'h', '▁ich', '▁nun', ',', '▁ich', '▁ar', 'mer', '▁Tor', '!', '▁Und', '▁bin', '▁so', '▁kl', 'ug', '▁als', '▁wie', '▁zu', 'vor', '.']

Model: Smaug-72B-v0.1. Length: 21
['Da', 'Ġst', 'eh', 'Ġich', 'Ġnun', ',', 'Ġich', 'Ġar', 'mer', 'ĠTor', '!', 'ĠUnd', 'Ġbin', 'Ġso', 'Ġkl', 'ug', 'Ġals', 'Ġwie', 'Ġzu', 'vor', '.']

Model: mt5. Length: 20
['▁Da', '▁steh', '▁ich', '▁nun', ',', '▁ich', '▁arm', 'er', '▁Tor', '!', '▁Und', '▁bin', '▁so', '▁k', 'lug', '▁als', '▁wie', '▁zu', 'vor', '.']

LANGUAGE: French. TEXT: On ne voit bien qu'avec le cœur. L'essentiel est invisible pour les yeux.

Model: Mixtral-8x7B-Instruct-v0.1. Length: 25
['▁On', '▁ne', '▁vo', 'it', '▁bien', '▁qu', "'", 'ave', 'c', '▁le', '▁c', 'œur', '.', '▁L', "'", 'ess', 'ent', 'iel', '▁est', '▁invisible', '▁pour', '▁les', '▁ye', 'ux', '.']

Model: Mistral-7B-Instruct-v0.2. Length: 25
['▁On', '▁ne', '▁vo', 'it', '▁bien', '▁qu', "'", 'ave', 'c', '▁le', '▁c', 'œur', '.', '▁L', "'", 'ess', 'ent', 'iel', '▁est', '▁invisible', '▁pour', '▁les', '▁ye', 'ux', '.']

Model: falcon-7b-instruct. Length: 21
['On', 'Ġne', 'Ġvoit', 'Ġbien', 'Ġqu', "'", 'avec', 'Ġle', 'ĠcÅĵur', '.', 'ĠL', "'", 'ess', 'ent', 'iel', 'Ġest', 'Ġinvisible', 'Ġpour', 'Ġles', 'Ġyeux', '.']

Model: Llama-2-7b-chat-hf. Length: 23
['▁On', '▁ne', '▁voit', '▁bien', '▁qu', "'", 'ave', 'c', '▁le', '▁c', 'œur', '.', '▁L', "'", 'ess', 'ent', 'iel', '▁est', '▁invisible', '▁pour', '▁les', '▁yeux', '.']

Model: Smaug-72B-v0.1. Length: 23
['On', 'Ġne', 'Ġvo', 'it', 'Ġbien', 'Ġqu', "'", 'avec', 'Ġle', 'ĠcÅĵur', '.', 'ĠL', "'", 'ess', 'ent', 'iel', 'Ġest', 'Ġinvisible', 'Ġpour', 'Ġles', 'Ġye', 'ux', '.']

Model: mt5. Length: 24
['▁On', '▁ne', '▁voit', '▁bien', '▁qu', "'", 'avec', '▁le', '▁c', 'œ', 'ur', '.', '▁L', "'", 'essentiel', '▁est', '▁', 'invisible', '▁pour', '▁les', '▁', 'y', 'eux', '.']

LANGUAGE: Spanish. TEXT: En un lugar de la Mancha, de cuyo nombre no quiero acordarme...

Model: Mixtral-8x7B-Instruct-v0.1. Length: 20
['▁En', '▁un', '▁lugar', '▁de', '▁la', '▁Man', 'cha', ',', '▁de', '▁cu', 'yo', '▁nombre', '▁no', '▁qu', 'iero', '▁ac', 'ord', 'ar', 'me', '...']

Model: Mistral-7B-Instruct-v0.2. Length: 20
['▁En', '▁un', '▁lugar', '▁de', '▁la', '▁Man', 'cha', ',', '▁de', '▁cu', 'yo', '▁nombre', '▁no', '▁qu', 'iero', '▁ac', 'ord', 'ar', 'me', '...']

Model: falcon-7b-instruct. Length: 17
['En', 'Ġun', 'Ġlugar', 'Ġde', 'Ġla', 'ĠMan', 'cha', ',', 'Ġde', 'Ġc', 'uyo', 'Ġnombre', 'Ġno', 'Ġquiero', 'Ġacord', 'arme', '...']

Model: Llama-2-7b-chat-hf. Length: 20
['▁En', '▁un', '▁lugar', '▁de', '▁la', '▁Man', 'cha', ',', '▁de', '▁cu', 'yo', '▁nombre', '▁no', '▁qu', 'iero', '▁ac', 'ord', 'ar', 'me', '...']

Model: Smaug-72B-v0.1. Length: 18
['En', 'Ġun', 'Ġlugar', 'Ġde', 'Ġla', 'ĠMan', 'cha', ',', 'Ġde', 'Ġc', 'uyo', 'Ġnombre', 'Ġno', 'Ġquiero', 'Ġac', 'ord', 'arme', '...']

Model: mt5. Length: 18
['▁En', '▁un', '▁lugar', '▁de', '▁la', '▁Manch', 'a', ',', '▁de', '▁cu', 'yo', '▁nombre', '▁no', '▁quier', 'o', '▁acord', 'arme', '...']

LANGUAGE: Polish. TEXT: Litwo, ojczyzno moja! Ty jesteś jak zdrowie; Ile cię trzeba cenić, ten tylko się dowie, kto cię stracił.

Model: Mixtral-8x7B-Instruct-v0.1. Length: 48
['▁Lit', 'wo', ',', '▁o', 'j', 'czy', 'z', 'no', '▁mo', 'ja', '!', '▁Ty', '▁j', 'este', 'ś', '▁jak', '▁zd', 'row', 'ie', ';', '▁I', 'le', '▁ci', 'ę', '▁tr', 'z', 'eb', 'a', '▁c', 'en', 'ić', ',', '▁ten', '▁t', 'yl', 'ko', '▁się', '▁d', 'owie', ',', '▁k', 'to', '▁ci', 'ę', '▁str', 'aci', 'ł', '.']

Model: Mistral-7B-Instruct-v0.2. Length: 48
['▁Lit', 'wo', ',', '▁o', 'j', 'czy', 'z', 'no', '▁mo', 'ja', '!', '▁Ty', '▁j', 'este', 'ś', '▁jak', '▁zd', 'row', 'ie', ';', '▁I', 'le', '▁ci', 'ę', '▁tr', 'z', 'eb', 'a', '▁c', 'en', 'ić', ',', '▁ten', '▁t', 'yl', 'ko', '▁się', '▁d', 'owie', ',', '▁k', 'to', '▁ci', 'ę', '▁str', 'aci', 'ł', '.']

Model: falcon-7b-instruct. Length: 39
['Lit', 'wo', ',', 'Ġo', 'j', 'czy', 'z', 'no', 'Ġmo', 'ja', '!', 'ĠTy', 'Ġjeste', 'ÅĽ', 'Ġjak', 'Ġzdrow', 'ie', ';', 'ĠI', 'le', 'Ġci', 'ÄĻ', 'Ġtrzeba', 'Ġcen', 'iÄĩ', ',', 'Ġten', 'Ġtylko', 'ĠsiÄĻ', 'Ġd', 'owie', ',', 'Ġkto', 'Ġci', 'ÄĻ', 'Ġstr', 'aci', 'ÅĤ', '.']

Model: Llama-2-7b-chat-hf. Length: 47
['▁Lit', 'wo', ',', '▁o', 'j', 'czy', 'z', 'no', '▁mo', 'ja', '!', '▁Ty', '▁j', 'este', 'ś', '▁jak', '▁zd', 'row', 'ie', ';', '▁I', 'le', '▁ci', 'ę', '▁tr', 'z', 'eb', 'a', '▁c', 'eni', 'ć', ',', '▁ten', '▁tyl', 'ko', '▁się', '▁d', 'owie', ',', '▁k', 'to', '▁ci', 'ę', '▁stra', 'ci', 'ł', '.']

Model: Smaug-72B-v0.1. Length: 42
['Lit', 'wo', ',', 'Ġo', 'j', 'czy', 'z', 'no', 'Ġmo', 'ja', '!', 'ĠTy', 'ĠjesteÅĽ', 'Ġjak', 'Ġzd', 'row', 'ie', ';', 'ĠI', 'le', 'Ġci', 'ÄĻ', 'Ġtr', 'ze', 'ba', 'Ġc', 'eni', 'Äĩ', ',', 'Ġten', 'Ġtylko', 'ĠsiÄĻ', 'Ġd', 'owie', ',', 'Ġkto', 'Ġci', 'ÄĻ', 'Ġstr', 'aci', 'ÅĤ', '.']

Model: mt5. Length: 39
['▁Li', 'two', ',', '▁', 'oj', 'czyzn', 'o', '▁moja', '!', '▁Ty', '▁jest', 'eś', '▁jak', '▁', 'zdrowie', ';', '▁Ile', '▁', 'cię', '▁', 'trzeb', 'a', '▁cen', 'ić', ',', '▁ten', '▁', 'tylko', '▁się', '▁d', 'owie', ',', '▁', 'kto', '▁', 'cię', '▁stra', 'cił', '.']

LANGUAGE: Ukrainian. TEXT: Як умру, то поховайте мене на могилі серед степу широкого на Вкраїні милій.

Model: Mixtral-8x7B-Instruct-v0.1. Length: 32
['▁Я', 'к', '▁у', 'м', 'ру', ',', '▁то', '▁по', 'хо', 'ва', 'й', 'те', '▁м', 'ене', '▁на', '▁мо', 'ги', 'лі', '▁серед', '▁сте', 'пу', '▁ши', 'ро', 'кого', '▁на', '▁В', 'краї', 'ні', '▁ми', 'лі', 'й', '.']

Model: Mistral-7B-Instruct-v0.2. Length: 32
['▁Я', 'к', '▁у', 'м', 'ру', ',', '▁то', '▁по', 'хо', 'ва', 'й', 'те', '▁м', 'ене', '▁на', '▁мо', 'ги', 'лі', '▁серед', '▁сте', 'пу', '▁ши', 'ро', 'кого', '▁на', '▁В', 'краї', 'ні', '▁ми', 'лі', 'й', '.']

Model: falcon-7b-instruct. Length: 79
['Ð', '¯', 'к', 'ĠÑĥ', 'Ð', '¼', 'ÑĢÑĥ', ',', 'Ġ', 'ÑĤÐ', '¾', 'ĠпÐ', '¾', 'Ñħ', 'Ð', '¾', 'Ð', '²', 'аÐ', '¹', 'ÑĤ', 'е', 'ĠÐ', '¼', 'еÐ', '½', 'е', 'ĠÐ', '½', 'а', 'ĠÐ', '¼', 'Ð', '¾', 'Ð', '³', 'и', 'л', 'Ñĸ', 'ĠÑģ', 'еÑĢ', 'еÐ', '´', 'ĠÑģ', 'ÑĤ', 'еÐ', '¿', 'Ñĥ', 'ĠÑ', 'Ī', 'и', 'ÑĢÐ', '¾', 'кÐ', '¾', 'Ð', '³', 'Ð', '¾', 'ĠÐ', '½', 'а', 'ĠÐ', 'Ĵ', 'к', 'ÑĢа', 'Ñ', 'Ĺ', 'Ð', '½', 'Ñĸ', 'ĠÐ', '¼', 'и', 'л', 'Ñĸ', 'Ð', '¹', '.']

Model: Llama-2-7b-chat-hf. Length: 29
['▁Я', 'к', '▁у', 'м', 'ру', ',', '▁то', '▁по', 'х', 'ова', 'й', 'те', '▁мене', '▁на', '▁мо', 'ги', 'лі', '▁серед', '▁сте', 'пу', '▁широ', 'кого', '▁на', '▁В', 'краї', 'ні', '▁ми', 'лій', '.']

Model: Smaug-72B-v0.1. Length: 34
['Я', 'к', 'ĠÑĥ', 'м', 'ÑĢÑĥ', ',', 'ĠÑĤо', 'Ġпо', 'Ñħ', 'ов', 'айÑĤе', 'Ġмен', 'е', 'Ġна', 'Ġмог', 'ил', 'Ñĸ', 'ĠÑģеÑĢ', 'ед', 'ĠÑģÑĤеп', 'Ñĥ', 'ĠÑĪиÑĢ', 'ок', 'ого', 'Ġна', 'ĠÐĴ', 'кÑĢа', 'ÑĹ', 'н', 'Ñĸ', 'Ġмил', 'Ñĸ', 'й', '.']

Model: mt5. Length: 23
['▁Як', '▁умр', 'у', ',', '▁то', '▁похо', 'вайте', '▁мене', '▁на', '▁', 'моги', 'лі', '▁серед', '▁сте', 'пу', '▁широк', 'ого', '▁на', '▁В', 'країні', '▁мил', 'ій', '.']

LANGUAGE: Greek. TEXT: ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ

Model: Mixtral-8x7B-Instruct-v0.1. Length: 53
['▁', 'ἄ', 'ν', 'δ', 'ρ', 'α', '▁', 'μ', 'ο', 'ι', '▁', 'ἔ', 'ν', 'ν', 'ε', 'π', 'ε', ',', '▁', 'Μ', 'ο', 'ῦ', 'σ', 'α', ',', '▁', 'π', 'ο', 'λ', 'ύ', 'τ', 'ρ', 'ο', 'π', 'ο', 'ν', ',', '▁', '<0xE1>', '<0xBD>', '<0x83>', 'ς', '▁', 'μ', 'ά', 'λ', 'α', '▁', 'π', 'ο', 'λ', 'λ', 'ὰ']

Model: Mistral-7B-Instruct-v0.2. Length: 53
['▁', 'ἄ', 'ν', 'δ', 'ρ', 'α', '▁', 'μ', 'ο', 'ι', '▁', 'ἔ', 'ν', 'ν', 'ε', 'π', 'ε', ',', '▁', 'Μ', 'ο', 'ῦ', 'σ', 'α', ',', '▁', 'π', 'ο', 'λ', 'ύ', 'τ', 'ρ', 'ο', 'π', 'ο', 'ν', ',', '▁', '<0xE1>', '<0xBD>', '<0x83>', 'ς', '▁', 'μ', 'ά', 'λ', 'α', '▁', 'π', 'ο', 'λ', 'λ', 'ὰ']

Model: falcon-7b-instruct. Length: 60
['á', '¼', 'Ħ', 'Î', '½', 'δ', 'Ïģ', 'α', 'ĠÎ', '¼', 'οÎ', '¹', 'Ġá', '¼', 'Ķ', 'Î', '½', 'Î', '½', 'ε', 'ÏĢ', 'ε', ',', 'ĠÎ', 'ľ', 'ο', 'á', '¿', '¦', 'Ïĥ', 'α', ',', 'ĠÏĢ', 'ο', 'λ', 'Ï', 'į', 'ÏĦ', 'Ïģ', 'ο', 'ÏĢ', 'οÎ', '½', ',', 'Ġá', '½', 'ĥ', 'ÏĤ', 'ĠÎ', '¼', 'ά', 'λ', 'α', 'ĠÏĢ', 'ο', 'λ', 'λ', 'á', '½', '°']

Model: Llama-2-7b-chat-hf. Length: 53
['▁', 'ἄ', 'ν', 'δ', 'ρ', 'α', '▁', 'μ', 'ο', 'ι', '▁', 'ἔ', 'ν', 'ν', 'ε', 'π', 'ε', ',', '▁', 'Μ', 'ο', 'ῦ', 'σ', 'α', ',', '▁', 'π', 'ο', 'λ', 'ύ', 'τ', 'ρ', 'ο', 'π', 'ο', 'ν', ',', '▁', '<0xE1>', '<0xBD>', '<0x83>', 'ς', '▁', 'μ', 'ά', 'λ', 'α', '▁', 'π', 'ο', 'λ', 'λ', 'ὰ']

Model: Smaug-72B-v0.1. Length: 48
['á¼Ħ', 'ν', 'δ', 'Ïģ', 'α', 'Ġμ', 'ο', 'ι', 'Ġá', '¼', 'Ķ', 'ν', 'ν', 'ε', 'ÏĢ', 'ε', ',', 'ĠÎ', 'ľ', 'ο', 'ῦ', 'Ïĥ', 'α', ',', 'ĠÏĢ', 'ο', 'λ', 'Ïį', 'ÏĦ', 'Ïģ', 'ο', 'ÏĢ', 'ο', 'ν', ',', 'Ġá', '½', 'ĥ', 'ÏĤ', 'Ġμ', 'ά', 'λ', 'α', 'ĠÏĢ', 'ο', 'λ', 'λ', 'á½°']

Model: mt5. Length: 26
['▁', 'ἄ', 'νδρ', 'α', '▁', 'μοι', '▁ἔ', 'ν', 'ν', 'επε', ',', '▁Μ', 'οῦ', 'σα', ',', '▁πολύ', 'τροπ', 'ον', ',', '▁', 'ὃ', 'ς', '▁μά', 'λα', '▁πολλ', 'ὰ']

LANGUAGE: Hebrew. TEXT: בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ

Model: Mixtral-8x7B-Instruct-v0.1. Length: 59
['▁', 'ב', 'ְ', 'ּ', 'ר', 'ֵ', 'א', 'ש', 'ִ', 'ׁ', 'י', 'ת', '▁', 'ב', 'ָ', 'ּ', 'ר', 'ָ', 'א', '▁', 'א', '<0xD6>', '<0xB1>', 'ל', 'ֹ', 'ה', 'ִ', 'י', 'ם', '▁', 'א', 'ֵ', 'ת', '▁', 'ה', 'ַ', 'ש', 'ָ', 'ּ', 'ׁ', 'מ', 'ַ', 'י', 'ִ', 'ם', '▁', 'ו', 'ְ', 'א', 'ֵ', 'ת', '▁', 'ה', 'ָ', 'א', 'ָ', 'ר', 'ֶ', 'ץ']

Model: Mistral-7B-Instruct-v0.2. Length: 59
['▁', 'ב', 'ְ', 'ּ', 'ר', 'ֵ', 'א', 'ש', 'ִ', 'ׁ', 'י', 'ת', '▁', 'ב', 'ָ', 'ּ', 'ר', 'ָ', 'א', '▁', 'א', '<0xD6>', '<0xB1>', 'ל', 'ֹ', 'ה', 'ִ', 'י', 'ם', '▁', 'א', 'ֵ', 'ת', '▁', 'ה', 'ַ', 'ש', 'ָ', 'ּ', 'ׁ', 'מ', 'ַ', 'י', 'ִ', 'ם', '▁', 'ו', 'ְ', 'א', 'ֵ', 'ת', '▁', 'ה', 'ָ', 'א', 'ָ', 'ר', 'ֶ', 'ץ']

Model: falcon-7b-instruct. Length: 96
['×', 'ij', 'Ö', '°', 'Ö', '¼', '×', '¨', 'Ö', 'µ', '×', 'IJ', '×', '©', 'Ö', '´', '×', 'ģ', '×Ļ×', 'ª', 'Ġ×', 'ij', 'Ö', '¸', 'Ö', '¼', '×', '¨', 'Ö', '¸', '×', 'IJ', 'Ġ×', 'IJ', 'Ö', '±', '×', 'ľ', 'Ö', '¹', '×Ķ', 'Ö', '´', '×Ļ×', 'Ŀ', 'Ġ×', 'IJ', 'Ö', 'µ', '×', 'ª', 'Ġ×', 'Ķ', 'Ö', '·', '×', '©', 'Ö', '¸', 'Ö', '¼', '×', 'ģ', '×', 'ŀ', 'Ö', '·', '×Ļ', 'Ö', '´', '×', 'Ŀ', 'Ġ×', 'ķ', 'Ö', '°', '×', 'IJ', 'Ö', 'µ', '×', 'ª', 'Ġ×', 'Ķ', 'Ö', '¸', '×', 'IJ', 'Ö', '¸', '×', '¨', 'Ö', '¶', '×', '¥']

Model: Llama-2-7b-chat-hf. Length: 59
['▁', 'ב', 'ְ', 'ּ', 'ר', 'ֵ', 'א', 'ש', 'ִ', 'ׁ', 'י', 'ת', '▁', 'ב', 'ָ', 'ּ', 'ר', 'ָ', 'א', '▁', 'א', '<0xD6>', '<0xB1>', 'ל', 'ֹ', 'ה', 'ִ', 'י', 'ם', '▁', 'א', 'ֵ', 'ת', '▁', 'ה', 'ַ', 'ש', 'ָ', 'ּ', 'ׁ', 'מ', 'ַ', 'י', 'ִ', 'ם', '▁', 'ו', 'ְ', 'א', 'ֵ', 'ת', '▁', 'ה', 'ָ', 'א', 'ָ', 'ר', 'ֶ', 'ץ']

Model: Smaug-72B-v0.1. Length: 72
['×ij', 'Ö', '°', 'Ö', '¼', 'ר', 'Ö', 'µ', '×IJ', 'ש', 'Ö', '´', '×', 'ģ', '×Ļת', 'Ġ×ij', 'Ö', '¸', 'Ö', '¼', 'ר', 'Ö', '¸', '×IJ', 'Ġ×IJ', 'Ö', '±', '׾', 'Ö', '¹', '×Ķ', 'Ö', '´', '×Ļ×Ŀ', 'Ġ×IJ', 'Ö', 'µ', 'ת', 'Ġ×Ķ', 'Ö', '·', 'ש', 'Ö', '¸', 'Ö', '¼', '×', 'ģ', '×ŀ', 'Ö', '·', '×Ļ', 'Ö', '´', '×Ŀ', 'Ġ×ķ', 'Ö', '°', '×IJ', 'Ö', 'µ', 'ת', 'Ġ×Ķ', 'Ö', '¸', '×IJ', 'Ö', '¸', 'ר', 'Ö', '¶', '×¥']

Model: mt5. Length: 35
['▁ב', 'ְּ', 'רֵ', 'אש', 'ִ', 'ׁ', 'ית', '▁ב', 'ָּ', 'ר', 'ָא', '▁א', 'ֱ', 'ל', 'ֹ', 'ה', 'ִים', '▁', 'אֵ', 'ת', '▁הַ', 'ש', 'ָּ', 'ׁ', 'מ', 'ַיִ', 'ם', '▁וְא', 'ֵ', 'ת', '▁ה', 'ָ', 'אָר', 'ֶ', 'ץ']

LANGUAGE: Arabic. TEXT: حدثني أبي، عن جدي، عن أجداده

Model: Mixtral-8x7B-Instruct-v0.1. Length: 29
['▁', 'ح', 'د', 'ث', 'ن', 'ي', '▁', 'أ', 'ب', 'ي', '،', '▁', 'ع', 'ن', '▁', 'ج', 'د', 'ي', '،', '▁', 'ع', 'ن', '▁', 'أ', 'ج', 'د', 'ا', 'د', 'ه']

Model: Mistral-7B-Instruct-v0.2. Length: 29
['▁', 'ح', 'د', 'ث', 'ن', 'ي', '▁', 'أ', 'ب', 'ي', '،', '▁', 'ع', 'ن', '▁', 'ج', 'د', 'ي', '،', '▁', 'ع', 'ن', '▁', 'أ', 'ج', 'د', 'ا', 'د', 'ه']

Model: falcon-7b-instruct. Length: 33
['Ø', 'Ń', 'د', 'Ø', '«', 'ÙĨ', 'ÙĬ', 'ĠØ', '£', 'ب', 'ÙĬ', 'Ø', 'Į', 'ĠØ', '¹', 'ÙĨ', 'ĠØ', '¬', 'د', 'ÙĬ', 'Ø', 'Į', 'ĠØ', '¹', 'ÙĨ', 'ĠØ', '£', 'Ø', '¬', 'د', 'اØ', '¯', 'Ùĩ']

Model: Llama-2-7b-chat-hf. Length: 29
['▁', 'ح', 'د', 'ث', 'ن', 'ي', '▁', 'أ', 'ب', 'ي', '،', '▁', 'ع', 'ن', '▁', 'ج', 'د', 'ي', '،', '▁', 'ع', 'ن', '▁', 'أ', 'ج', 'د', 'ا', 'د', 'ه']

Model: Smaug-72B-v0.1. Length: 13
['ØŃدث', 'ÙĨÙĬ', 'ĠأبÙĬ', 'ØĮ', 'ĠعÙĨ', 'Ġج', 'دÙĬ', 'ØĮ', 'ĠعÙĨ', 'ĠØ£', 'جد', 'اد', 'Ùĩ']

Model: mt5. Length: 12
['▁', 'حدث', 'ني', '▁', 'أبي', '،', '▁عن', '▁جدي', '،', '▁عن', '▁أج', 'داده']

LANGUAGE: Hindi. TEXT: कर्मण्येवाधिकारस्ते मा फलेषु कदाचन। मा कर्मफलहेतुर्भूर्मा ते संगोऽस्त्वकर्मणि।

Model: Mixtral-8x7B-Instruct-v0.1. Length: 81
['▁', 'क', 'र', '्', 'म', 'ण', '्', 'य', 'े', 'व', 'ा', 'ध', 'ि', 'क', 'ा', 'र', 'स', '्', 'त', 'े', '▁', 'म', 'ा', '▁', 'फ', 'ल', 'े', 'ष', 'ु', '▁', 'क', 'द', 'ा', 'च', 'न', '।', '▁', 'म', 'ा', '▁', 'क', 'र', '्', 'म', 'फ', 'ल', 'ह', 'े', 'त', 'ु', 'र', '्', 'भ', 'ू', 'र', '्', 'म', 'ा', '▁', 'त', 'े', '▁', 'स', 'ं', 'ग', 'ो', '<0xE0>', '<0xA4>', '<0xBD>', 'स', '्', 'त', '्', 'व', 'क', 'र', '्', 'म', 'ण', 'ि', '।']

Model: Mistral-7B-Instruct-v0.2. Length: 81
['▁', 'क', 'र', '्', 'म', 'ण', '्', 'य', 'े', 'व', 'ा', 'ध', 'ि', 'क', 'ा', 'र', 'स', '्', 'त', 'े', '▁', 'म', 'ा', '▁', 'फ', 'ल', 'े', 'ष', 'ु', '▁', 'क', 'द', 'ा', 'च', 'न', '।', '▁', 'म', 'ा', '▁', 'क', 'र', '्', 'म', 'फ', 'ल', 'ह', 'े', 'त', 'ु', 'र', '्', 'भ', 'ू', 'र', '्', 'म', 'ा', '▁', 'त', 'े', '▁', 'स', 'ं', 'ग', 'ो', '<0xE0>', '<0xA4>', '<0xBD>', 'स', '्', 'त', '्', 'व', 'क', 'र', '्', 'म', 'ण', 'ि', '।']

Model: falcon-7b-instruct. Length: 127
['à¤', 'ķ', 'र', 'à¥į', 'à¤', '®', 'à¤', '£', 'à¥į', 'à¤', '¯', 'à¥', 'ĩ', 'à¤', 'µ', 'à¤', '¾', 'à¤', '§', 'à¤', '¿', 'à¤', 'ķ', 'à¤', '¾', 'र', 'à¤', '¸', 'à¥į', 'à¤', '¤', 'à¥', 'ĩ', 'Ġà¤', '®', 'à¤', '¾', 'Ġà¤', '«', 'à¤', '²', 'à¥', 'ĩ', 'à¤', '·', 'à¥', 'ģ', 'Ġà¤', 'ķ', 'à¤', '¦', 'à¤', '¾', 'à¤', 'ļ', 'à¤', '¨', 'à¥', '¤', 'Ġà¤', '®', 'à¤', '¾', 'Ġà¤', 'ķ', 'र', 'à¥į', 'à¤', '®', 'à¤', '«', 'à¤', '²', 'à¤', '¹', 'à¥', 'ĩ', 'à¤', '¤', 'à¥', 'ģ', 'र', 'à¥į', 'à¤', 'Ń', 'à¥', 'Ĥ', 'र', 'à¥į', 'à¤', '®', 'à¤', '¾', 'Ġà¤', '¤', 'à¥', 'ĩ', 'Ġà¤', '¸', 'à¤', 'Ĥ', 'à¤', 'Ĺ', 'à¥', 'ĭ', 'à¤', '½', 'à¤', '¸', 'à¥į', 'à¤', '¤', 'à¥į', 'à¤', 'µ', 'à¤', 'ķ', 'र', 'à¥į', 'à¤', '®', 'à¤', '£', 'à¤', '¿', 'à¥', '¤']

Model: Llama-2-7b-chat-hf. Length: 85
['▁', 'क', 'र', '्', 'म', 'ण', '्', 'य', 'े', 'व', 'ा', 'ध', 'ि', 'क', 'ा', 'र', 'स', '्', 'त', 'े', '▁', 'म', 'ा', '▁', '<0xE0>', '<0xA4>', '<0xAB>', 'ल', 'े', 'ष', 'ु', '▁', 'क', 'द', 'ा', 'च', 'न', '।', '▁', 'म', 'ा', '▁', 'क', 'र', '्', 'म', '<0xE0>', '<0xA4>', '<0xAB>', 'ल', 'ह', 'े', 'त', 'ु', 'र', '्', 'भ', 'ू', 'र', '्', 'म', 'ा', '▁', 'त', 'े', '▁', 'स', 'ं', 'ग', 'ो', '<0xE0>', '<0xA4>', '<0xBD>', 'स', '्', 'त', '्', 'व', 'क', 'र', '्', 'म', 'ण', 'ि', '।']

Model: Smaug-72B-v0.1. Length: 75
['à¤ķ', 'र', 'à¥įà¤', '®', 'ण', 'à¥įà¤', '¯', 'à¥ĩà¤', 'µ', 'ाà¤', '§', 'िà¤', 'ķ', 'ाà¤', '°', 'स', 'à¥įà¤', '¤', 'à¥ĩ', 'Ġम', 'ा', 'Ġà¤', '«', 'ल', 'à¥ĩà¤', '·', 'à¥ģ', 'Ġà¤ķ', 'द', 'ाà¤', 'ļ', 'न', '।', 'Ġम', 'ा', 'Ġà¤ķ', 'र', 'à¥įà¤', '®', 'फ', 'ल', 'ह', 'à¥ĩà¤', '¤', 'à¥ģ', 'र', 'à¥įà¤', 'Ń', 'à¥', 'Ĥ', 'र', 'à¥įà¤', '®', 'ा', 'Ġà¤', '¤', 'à¥ĩ', 'Ġस', 'à¤Ĥ', 'à¤Ĺ', 'à¥ĭ', 'à¤', '½', 'स', 'à¥įà¤', '¤', 'à¥įà¤', 'µ', 'à¤ķ', 'र', 'à¥įà¤', '®', 'ण', 'ि', '।']

Model: mt5. Length: 34
['▁कर्म', 'ण्य', 'े', 'व', 'ाधिकार', 'स्', 'ते', '▁मा', '▁', 'फल', 'ेष', 'ु', '▁क', 'दा', 'चन', '।', '▁मा', '▁कर्म', 'फल', 'हे', 'तु', 'र्भ', 'ू', 'र्मा', '▁', 'ते', '▁सं', 'गो', 'ऽ', 'स्', 'त्व', 'कर्म', 'णि', '।']

LANGUAGE: Chinese. TEXT: 道可道,非常道。名可名,非常名。

Model: Mixtral-8x7B-Instruct-v0.1. Length: 17
['▁', '道', '可', '道', ',', '非', '常', '道', '。', '名', '可', '名', ',', '非', '常', '名', '。']

Model: Mistral-7B-Instruct-v0.2. Length: 17
['▁', '道', '可', '道', ',', '非', '常', '道', '。', '名', '可', '名', ',', '非', '常', '名', '。']

Model: falcon-7b-instruct. Length: 18
['éģĵ', 'åı¯', 'éģĵ', 'ï', '¼', 'Į', 'éĿŀ常', 'éģĵ', 'ãĢĤ', 'åIJį', 'åı¯', 'åIJį', 'ï', '¼', 'Į', 'éĿŀ常', 'åIJį', 'ãĢĤ']

Model: Llama-2-7b-chat-hf. Length: 17
['▁', '道', '可', '道', ',', '非', '常', '道', '。', '名', '可', '名', ',', '非', '常', '名', '。']

Model: Smaug-72B-v0.1. Length: 14
['éģĵ', 'åı¯', 'éģĵ', 'ï¼Į', 'éĿŀ常', 'éģĵ', 'ãĢĤ', 'åIJį', 'åı¯', 'åIJį', 'ï¼Į', 'éĿŀ常', 'åIJį', 'ãĢĤ']

Model: mt5. Length: 15
['▁', '道', '可', '道', ',', '非常', '道', '。', '名', '可', '名', ',', '非常', '名', '。']

LANGUAGE: Japanese. TEXT: 私は、その男の写真を三葉、見たことがある。

Model: Mixtral-8x7B-Instruct-v0.1. Length: 22
['▁', '私', 'は', '、', 'そ', 'の', '男', 'の', '写', '真', 'を', '三', '葉', '、', '見', 'た', 'こ', 'と', 'が', 'あ', 'る', '。']

Model: Mistral-7B-Instruct-v0.2. Length: 22
['▁', '私', 'は', '、', 'そ', 'の', '男', 'の', '写', '真', 'を', '三', '葉', '、', '見', 'た', 'こ', 'と', 'が', 'あ', 'る', '。']

Model: falcon-7b-instruct. Length: 26
['ç§ģ', 'ãģ¯', 'ãĢģ', 'ãģ', 'Ŀ', 'ãģ®', 'çĶ·', 'ãģ®', 'åĨĻ', '羣', 'ãĤĴ', 'ä¸ī', 'èij', 'ī', 'ãĢģ', 'è¦', 'ĭ', 'ãģŁ', 'ãģ', 'ĵ', 'ãģ¨', 'ãģĮ', 'ãģ', 'Ĥ', 'ãĤĭ', 'ãĢĤ']

Model: Llama-2-7b-chat-hf. Length: 24
['▁', '<0xE7>', '<0xA7>', '<0x81>', 'は', '、', 'そ', 'の', '男', 'の', '写', '真', 'を', '三', '葉', '、', '見', 'た', 'こ', 'と', 'が', 'あ', 'る', '。']

Model: Smaug-72B-v0.1. Length: 14
['ç§ģãģ¯', 'ãĢģ', 'ãģĿãģ®', 'çĶ·', 'ãģ®', 'åĨĻ', '羣', 'ãĤĴ', 'ä¸ī', 'èijī', 'ãĢģ', 'è¦ĭãģŁ', 'ãģĵãģ¨ãģĮãģĤãĤĭ', 'ãĢĤ']

Model: mt5. Length: 11
['▁私は', '、', 'その', '男の', '写真を', '三', '葉', '、', '見たこと', 'がある', '。']

Now, let's compile and visualize the results to make them easier to understand.

import matplotlib.pyplot as plt

models = list(results.keys())
languages = list(results[models[0]].keys())

data = {model: [results[model].get(lang, {}).get('length', 0) for lang in languages] for model in models}

# Plotting
fig, ax = plt.subplots(figsize=(14, 8))

model_indices = range(len(models))
width = 0.35

for i, lang in enumerate(languages):
    lengths = [data[model][i] for model in models]
    if i == 0:
        ax.bar(model_indices, lengths, width, label=lang)
    else:
        prev_values = [sum(data[model][:i]) for model in models]
        ax.bar(model_indices, lengths, width, bottom=prev_values, label=lang)

ax.set_ylabel('Token Length')
ax.set_title('Token Length by Model and Language')
ax.set_xticks(model_indices)
ax.set_xticklabels(models, rotation=45, ha="right")
ax.legend(title="Languages", bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()
Tokenizer performance for Falcon, Mistral, Mixtral, Smaug, and other open-source LLMs

Each tokenizer comes with a different vocabulary size, which influences the amount of data required for fine-tuning the LLM. Let's compare these vocabulary sizes.

vocab_sizes = {name: tokenizer.vocab_size for name, tokenizer in tokenizers.items()}

plt.figure(figsize=(10, 6))
plt.bar(range(len(vocab_sizes)), list(vocab_sizes.values()), tick_label=list(vocab_sizes.keys()))
plt.xticks(rotation=45, ha="right")
plt.ylabel('Vocabulary Size')
plt.title('Vocabulary Sizes of Various Tokenizers')
plt.tight_layout()
plt.show()
Vocabulary sizes for Falcon, Mistral, Mixtral, Smaug, and other open-source LLMs

The remarkable quality of mt5's tokenization can largely be attributed to its extensive vocabulary size. However, the effectiveness of a tokenizer involves more than just its size. For instance, even though Smaug's vocabulary size is significantly larger than that of Llama-2, their tokenization performance is relatively similar.

It's clear that none of the popular LLMs match the efficiency of the mt5 tokenizer, with falcon-7b-instruct showing particular inefficiency on average. However, for certain languages, especially those within the Romance and Germanic families, the models perform comparably well. Yet, when we explore languages outside these families, the performance of popular LLMs in multilingual tasks drops significantly. An inferior multilingual tokenizer renders fine-tuning for such tasks nearly impossible. In our projects, we attempted to fine-tune Llama-2 for multilingual tasks using a vast and carefully curated dataset of over a billion tokens, only to switch to mt5 partway through the project.

Since mt5 only comes as a base model and requires substantial effort to teach it to follow instructions, it necessitates a large and high-quality dataset. Our clients were fortunate to have such a resource, but overall it is a rare and publicly unavailable asset.

Despite the impressive landscape of LLMs, there remains a gap for a high-performing multilingual model. The design of tokenizers in the most powerful open-access LLMs inherently limits their effectiveness in certain languages, and this is not something that can be overcome with additional training. Currently, training mt5 from the base model appears to be the most promising approach for achieving a truly multilingual model. Without it, creating an open-access LLM that is genuinely multilingual remains a challenge next to impossible.

Building your custom LLM could enhance data security and compliance and enable an AI competitive advantage for your product. You can check our other posts to get an extensive explanation of what the network effect is and how AI enables it, how to build an AI competitive advantage for your company, what culture helps you build the right AI products, what to avoid in your AI strategy and execution, and more.

If you need help in building an AI product for your business, look no further. Our team of AI technology consultants and engineers have years of experience in helping technology companies like yours build sustainable competitive advantages through AI technology. From data collection to algorithm development, we can help you stay ahead of the competition and secure your market share for years to come.

Contact us today to learn more about our AI technology consulting offering.

If you want to keep posted on how to build a sustainable competitive advantage with AI technologies, please subscribe to our blog post updates below.

bottom of page