In July 2023, Meta released Llama-2, a set of open-access large language models that quickly became a favorite for those concerned about data security and interested in building their own custom models. Unlike relying on generic third-party models, Llama-2 allows for tailored design and has been a leading choice for enhancements and specialized uses, often exceeding standard benchmarks.
Available through Huggingface, Llama-2 offers three sizes: 7B, 13B, and 70B parameters. You can use it as a foundational model or as a chat model for dialogue and answering questions. Meta also introduced CodeLlama, a variant of Llama-2, specifically trained to help with coding tasks.
If you're involved in AI and need a custom large language model (LLM), Llama-2 is likely on your radar as a potential starting point. We've been deeply involved with customizing, fine-tuning, and deploying Llama-2. Through this series of posts, we aim to share our insights on what Llama-2 can and can't do, along with detailed instructions and best practices for fine-tuning.
LLM Practitioner's Guide:
How Multilingual Llama-2 Actually Is?
How Multilingual Falcon, Mistral, Smaug, and other LLMs Are?
In today's post, we will explore the most significant limitation of Llama-2: its restricted language support. Despite its capabilities, extending Llama-2 to languages beyond a few of the most spoken European ones is a challenging task. Let us see why.
The Paths It Won't Tread: Understanding Llama-2's Language Limitations
At the core of Llama-2's limitations is its tokenizer, a crucial component in how it processes language.
Large Language Models (LLMs) like Llama-2 operate using numbers. To understand text, these models first convert it into a numerical format. The significant challenge in natural language processing is figuring out the most effective way to represent languages numerically.
One basic method is to assign each letter or symbol a unique number, for instance, "a" becomes 1, "b" becomes 2, and so forth. Then, the model predicts the next number in a sequence. However, this method makes even a short sentence like "Predicting the next word with a large language model is easy." a long sequence of 61 numbers. Any mistake in predicting a letter impacts all the following ones, making this approach complex and error-prone.
Another method is to represent each word with a number. In this system, the sentence becomes a series of 11 numbers, significantly shorter than the letter-by-letter sequence. This is easier for the model but still presents a problem. Considering the vast number of words—over 200,000 in the English language alone—this means for multiple languages, the model might need to choose from millions of numbers, increasing the likelihood of errors.
To tackle these issues, researchers developed a compromise called tokenization. Tokenization splits words into smaller parts, called tokens, which are common across various words. For instance, "predicting" might be split into "pred," "ict," and "ing." This means instead of guessing each letter or entire word, the model predicts smaller, more manageable sequences. These tokens dramatically reduce the number of options the model has to consider—from the entire dictionary to around 2,000 tokens for the English language, making the process more efficient and accurate.
Let's delve into the Llama-2 tokenizer.
First, we import it from the Huggingface Transformers library:
> from transformers import AutoTokenizer
> tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', use_fast=True)
The Llama-2 tokenizer supports an impressive array of 32,000 tokens:
> print(tokenizer.vocab_size)
32000
Now, let's observe it in action. Consider this sentence:
> print(tokenizer.tokenize('Call me Ishmael. Some years ago — never mind how long precisely — having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.'))
['▁Call', '▁me', '▁I', 'sh', 'ma', 'el', '.', '▁Some', '▁years', '▁ago', '▁—', '▁never', '▁mind', '▁how', '▁long', '▁precisely', '▁—', '▁having', '▁little', '▁or', '▁no', '▁money', '▁in', '▁my', '▁pur', 'se', ',', '▁and', '▁nothing', '▁particular', '▁to', '▁interest', '▁me', '▁on', '▁shore', ',', '▁I', '▁thought', '▁I', '▁would', '▁sail', '▁about', '▁a', '▁little', '▁and', '▁see', '▁the', '▁wat', 'ery', '▁part', '▁of', '▁the', '▁world', '.']
The output demonstrates efficient tokenization, breaking down common words into single tokens and complex ones into understandable parts. For example, "watery" is neatly decomposed, whereas "Ishmael" is represented syllabically, indicating a nuanced approach to tokenization.
Now, let's test it with a German sentence. Even if you don't speak German or any other languages mentioned, understand that effective tokenization splits text into a smaller number of longer tokens, while weaker tokenization results in a larger number of shorter tokens.
> print(tokenizer.tokenize('Da steh ich nun, ich armer Tor! Und bin so klug als wie zuvor.'))
['▁Da', '▁ste', 'h', '▁ich', '▁nun', ',', '▁ich', '▁ar', 'mer', '▁Tor', '!', '▁Und', '▁bin', '▁so', '▁kl', 'ug', '▁als', '▁wie', '▁zu', 'vor', '.']
The tokenizer performs admirably, reflecting the intricacies of the language, especially given that German is more computationally complex than English.
Let us see how Llama-2 deals with French.
> print(tokenizer.tokenize("On ne voit bien qu'avec le cœur. L'essentiel est invisible pour les yeux."))
['▁On', '▁ne', '▁voit', '▁bien', '▁qu', "'", 'ave', 'c', '▁le', '▁c', 'œur', '.', '▁L', "'", 'ess', 'ent', 'iel', '▁est', '▁invisible', '▁pour', '▁les', '▁yeux', '.']
Again, we see a very good language representation.
If we go on with Romance and Germanic languages, we will see that Llama-2 does quite well with them. So, without further due, let us move on to Slavic ones and start with Polish.
> print(tokenizer.tokenize('Litwo, ojczyzno moja! Ty jesteś jak zdrowie; Ile cię trzeba cenić, ten tylko się dowie, kto cię stracił.'))
['▁Lit', 'wo', ',', '▁o', 'j', 'czy', 'z', 'no', '▁mo', 'ja', '!', '▁Ty', '▁j', 'este', 'ś', '▁jak', '▁zd', 'row', 'ie', ';', '▁I', 'le', '▁ci', 'ę', '▁tr', 'z', 'eb', 'a', '▁c', 'eni', 'ć', ',', '▁ten', '▁tyl', 'ko', '▁się', '▁d', 'owie', ',', '▁k', 'to', '▁ci', 'ę', '▁stra', 'ci', 'ł', '.']
Polish is a notoriously computationally complex language, and you can immediately spot the tokenization quality decline: the tokens become much smaller, and their number increases.
Now, let us look at Ukrainian.
> print(tokenizer.tokenize('Як умру, то поховайте мене на могилі серед степу широкого на Вкраїні милій.'))
['▁Я', 'к', '▁у', 'м', 'ру', ',', '▁то', '▁по', 'х', 'ова', 'й', 'те', '▁мене', '▁на', '▁мо', 'ги', 'лі', '▁серед', '▁сте', 'пу', '▁широ', 'кого', '▁на', '▁В', 'краї', 'ні', '▁ми', 'лій', '.']
Here, we see a similar case: even if you do not read Ukrainian, you will notice lots of short tokens. Not good.
Yet, it is still something. Now, let us take a look at Greek.
> print(tokenizer.tokenize('ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ'))
['▁', 'ἄ', 'ν', 'δ', 'ρ', 'α', '▁', 'μ', 'ο', 'ι', '▁', 'ἔ', 'ν', 'ν', 'ε', 'π', 'ε', ',', '▁', 'Μ', 'ο', 'ῦ', 'σ', 'α', ',', '▁', 'π', 'ο', 'λ', 'ύ', 'τ', 'ρ', 'ο', 'π', 'ο', 'ν', ',', '▁', '<0xE1>', '<0xBD>', '<0x83>', 'ς', '▁', 'μ', 'ά', 'λ', 'α', '▁', 'π', 'ο', 'λ', 'λ', 'ὰ']
Here, we see the worst-case scenario: the tokenizer retreats to the first naïve case representing language character-wise.
Now, how does it read Hebrew?
> print(tokenizer.tokenize('בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ'))
['▁', 'ב', 'ְ', 'ּ', 'ר', 'ֵ', 'א', 'ש', 'ִ', 'ׁ', 'י', 'ת', '▁', 'ב', 'ָ', 'ּ', 'ר', 'ָ', 'א', '▁', 'א', '<0xD6>', '<0xB1>', 'ל', 'ֹ', 'ה', 'ִ', 'י', 'ם', '▁', 'א', 'ֵ', 'ת', '▁', 'ה', 'ַ', 'ש', 'ָ', 'ּ', 'ׁ', 'מ', 'ַ', 'י', 'ִ', 'ם', '▁', 'ו', 'ְ', 'א', 'ֵ', 'ת', '▁', 'ה', 'ָ', 'א', 'ָ', 'ר', 'ֶ', 'ץ']
Again, it is character-wise.
How about Chinese?
> print(tokenizer.tokenize("道可道,非常道。名可名,非常名。"))
['▁', '道', '可', '道', ',', '非', '常', '道', '。', '名', '可', '名', ',', '非', '常', '名', '。']
Same character-wise representation here.
If we keep doing that with other languages, we will see the same pattern: Llama-2 tokenizer does well with Romance and Germanic languages while quickly degrading into character-wise representation for the rest.
Admittedly, Meta has indicated that Llama-2 was primarily trained on a text corpus that is 90% English, suggesting the model might not be as effective with other languages. This warning is crucial but can be misleading. It was not just trained — it was designed with mostly English in mind.
When reading the model's description, some may believe training Llama-2 with non-English data could significantly improve its performance with other languages. However, while fine-tuning can help to a certain degree, it doesn't address the inherent limitations of the tokenizer design. Consequently, Llama-2 naturally performs better with some languages, such as English, German, Spanish, or French, compared to others, like Polish or Ukrainian. Attempting to extend its capabilities to Greek, Hebrew, Chinese, and many other languages is an even more challenging task. And since the tokenizer is tightly coupled with the model and cannot be replaced, there is no way around this issue.
In our experience, we tried to fine-tune Llama-2 to multilingual tasks with an enormous set of carefully curated supervised data of more than a billion tokens just to switch to another model halfway through the project.
For those needing to fine-tune a language model in any of these more challenging languages, Llama-2 may not be the optimal choice. Instead, selecting a smaller model with a more suitable tokenizer could lead to better results. Our experience in developing multi-lingual models suggests that opting for smaller and older models with better tokenizers and customizing them for specific tasks and languages could be far more effective than using Llama-2.
Building your custom LLM could enhance data security and compliance and enable an AI competitive advantage for your product. You can check our other posts to get an extensive explanation of what the network effect is and how AI enables it, how to build an AI competitive advantage for your company, what culture helps you build the right AI products, what to avoid in your AI strategy and execution, and more.
If you need help in building an AI product for your business, look no further. Our team of AI technology consultants and engineers have years of experience in helping technology companies like yours build sustainable competitive advantages through AI technology. From data collection to algorithm development, we can help you stay ahead of the competition and secure your market share for years to come.
Contact us today to learn more about our AI technology consulting offering.
If you want to keep posted on how to build a sustainable competitive advantage with AI technologies, please subscribe to our blog post updates below.