On February 21, 2024, Google unveiled its new open-access Large Language Model (LLM), Gemma. This release is a significant milestone, much more important than it might initially seem.
LLM Practitioner's Guide:
How Multilingual Falcon, Mistral, Smaug, and other LLMs Are?
Gemma, a Game-Changing Multilingual LLM
Ever since ChatGPT burst onto the scene, the realm of open-access LLMs has seen explosive growth. From the start, we've been deeply involved in training and customizing open-access chat LLMs. Despite the surrounding excitement, those of us working with multilingual models have been acutely aware of a critical issue. For languages outside the Romance and Germanic families, inefficient tokenizers have severely limited the performance of open-access LLMs.
What's a tokenizer, you might ask? It's an algorithm that breaks down text into smaller pieces, known as tokens. These tokens can be words, parts of words, individual characters, or other units, depending on the tokenizer's design. Tokenization is a crucial step in processing text for LLMs, enabling applications like text classification, question answering, and machine translation.
Tokenizers are vital to the natural language processing (NLP) pipeline, serving a singular goal: converting text into a format that the model can understand. Since models operate on numerical data, tokenizers transform our textual inputs into numbers that the model can process.
All LLMs, whether proprietary like ChatGPT or open-source, interpret the world through numbers. To make sense of text, these models first transform it into a numerical format. A major challenge in the field of natural language processing (NLP) is determining the most efficient way to represent languages as numbers.
One simple approach is to assign a unique number to each letter or symbol. For example, "a" might be assigned the number 1, "b" the number 2, and so on. The model then tries to predict the next number in a sequence. However, this method can turn even a short sentence like "Predicting the next word with a large language model is easy." into a lengthy sequence of 61 numbers. A single error in predicting a letter can throw off all subsequent predictions, making this method both complicated and prone to mistakes.
An alternative strategy assigns each word a unique number. Using this method, the sentence is condensed into a sequence of just 11 numbers, far fewer than the letter-by-letter approach. This makes the task simpler for the model, but it's not without its own challenges. With more than 200,000 words in the English language alone, a model that works across multiple languages may need to differentiate among millions of numbers, significantly increasing the chance of making errors.
To address these challenges, researchers have devised a method known as tokenization. Tokenization breaks words down into smaller pieces, or tokens, that are commonly found across various words. For example, the word "predicting" might be split into tokens like "pred," "ict," and "ing." This approach allows the model to predict smaller, more manageable sequences instead of each letter or whole words. As a result, the number of choices the model needs to consider is dramatically reduced—from an entire dictionary to around 2,000 tokens for the English language. This makes the prediction process both more efficient and accurate.
However, there's a significant issue: all the well-known open-access chat LLMs, such as Llama, Llama-2, Mistral, Mixtral, Falcon, Smaug, and others, have tokenizers that are primarily designed for Romance and Germanic languages. These tokenizers are optimized for languages with a large number of short, syllable-like parts and tend to reduce other languages—non-Germanic, non-Romance, and non-Slavic languages such as Greek, Chinese, Hebrew, Arabic, Hindi, Korean, Japanese, etc.—to characters.
In simpler terms, this means that while these LLMs offer support for Romance and Germanic languages at the cutting-edge level of 2024 technology, they handle all other languages as if the technology were still stuck in 2015.
Moreover, while you can enhance a model's performance by training it with more or higher-quality data, or by using specific data tailored to your task, the tokenizer presents a challenge that cannot be similarly addressed. There is no way to rectify the tokenization issues with models like Llama-2, Falcon, or Mistral because the tokenizer operates based on a deterministic algorithm, not machine learning. You cannot train or modify it. Moreover, the model is intricately linked to its tokenizer; attempting to replace the tokenizer in a model like Llama-2 would make the model non-functional.
There have been models with effective multilingual tokenizers, such as Google’s mt5, released in 2020. However, these were "base" models and not specifically trained for conversational tasks.
This situation has effectively limited the practical applications of LLMs to primarily Romance and Germanic languages. For those of us engaged in multilingual LLM projects, this presented a difficult decision. To achieve a high-quality conversational LLM, you had two choices:
1. Fine-tune modern LLMs, such as Llama-2 or Mistral, in a way reminiscent of 2015, utilizing character-based text representation.
2. Start with a base model equipped with a robust tokenizer and train it to be conversational.
Both options demand extensive and highly accurate datasets. For most languages and companies, this made the fine-tuning of modern multilingual conversational LLMs practically out of reach.
Until now.
Gemma
Gemma represents a new generation of lightweight, cutting-edge open models that build upon the research and technology behind the Gemini models. Developed by Google DeepMind and various Google teams, Gemma draws inspiration from Gemini, with its name stemming from the Latin word "gemma," meaning "precious stone." Alongside the release of the model weights, Google also provides tools designed to spark developer innovation, encourage collaboration, and ensure the responsible use of Gemma models.
Here's what you need to know:
Google has unveiled the Gemma model weights in two versions: Gemma 2B and Gemma 7B. Each version comes in pre-trained and instruction-tuned variants.
The newly introduced Responsible Generative AI Toolkit offers guidance and vital tools for developing safer AI applications using Gemma.
Google supports a range of toolchains for inference and supervised fine-tuning (SFT) across all the major frameworks, including JAX, PyTorch, and TensorFlow, which is natively supported in Keras 3.0.
To facilitate easy adoption, Google has made available ready-to-use Colab and Kaggle notebooks, as well as integration with widely-used tools like Hugging Face, MaxText, NVIDIA NeMo, and TensorRT-LLM.
The pre-trained and instruction-tuned Gemma models are designed to run smoothly on various platforms, from your personal laptop to workstations, or on Google Cloud, with straightforward deployment options on Vertex AI and Google Kubernetes Engine (GKE).
Gemma models are optimized for top performance across a range of AI hardware platforms, including NVIDIA GPUs and Google Cloud TPUs.
The terms of use allow for responsible commercial use and distribution by organizations of any size.
We previously highlighted mt5, Google's 2020 LLM renowned for its exceptional multilingual tokenizer, capable of supporting all major languages. Now, meet Gemma: equipped with an even more advanced tokenizer. In our latest analysis, we conducted thorough tests on the tokenizers of major LLMs across a variety of languages, and here's how Gemma's tokenizer stands out against the competition:
And here is a direct comparison of tokenization performance between Gemma and Llama-2:
LANGUAGE: English. TEXT: Call me Ishmael. Some years ago — never mind how long precisely — having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
Model: Llama-2-7b-chat-hf. Length: 54
['▁Call', '▁me', '▁I', 'sh', 'ma', 'el', '.', '▁Some', '▁years', '▁ago', '▁—', '▁never', '▁mind', '▁how', '▁long', '▁precisely', '▁—', '▁having', '▁little', '▁or', '▁no', '▁money', '▁in', '▁my', '▁pur', 'se', ',', '▁and', '▁nothing', '▁particular', '▁to', '▁interest', '▁me', '▁on', '▁shore', ',', '▁I', '▁thought', '▁I', '▁would', '▁sail', '▁about', '▁a', '▁little', '▁and', '▁see', '▁the', '▁wat', 'ery', '▁part', '▁of', '▁the', '▁world', '.']
Model: gemma-7b-it. Length: 50
['Call', '▁me', '▁Ish', 'mael', '.', '▁Some', '▁years', '▁ago', '▁—', '▁never', '▁mind', '▁how', '▁long', '▁precisely', '▁—', '▁having', '▁little', '▁or', '▁no', '▁money', '▁in', '▁my', '▁purse', ',', '▁and', '▁nothing', '▁particular', '▁to', '▁interest', '▁me', '▁on', '▁shore', ',', '▁I', '▁thought', '▁I', '▁would', '▁sail', '▁about', '▁a', '▁little', '▁and', '▁see', '▁the', '▁watery', '▁part', '▁of', '▁the', '▁world', '.']
LANGUAGE: German. TEXT: Da steh ich nun, ich armer Tor! Und bin so klug als wie zuvor.
Model: Llama-2-7b-chat-hf. Length: 21
['▁Da', '▁ste', 'h', '▁ich', '▁nun', ',', '▁ich', '▁ar', 'mer', '▁Tor', '!', '▁Und', '▁bin', '▁so', '▁kl', 'ug', '▁als', '▁wie', '▁zu', 'vor', '.']
Model: gemma-7b-it. Length: 20
['Da', '▁ste', 'h', '▁ich', '▁nun', ',', '▁ich', '▁ar', 'mer', '▁Tor', '!', '▁Und', '▁bin', '▁so', '▁k', 'lug', '▁als', '▁wie', '▁zuvor', '.']
LANGUAGE: French. TEXT: On ne voit bien qu'avec le cœur. L'essentiel est invisible pour les yeux.
Model: Llama-2-7b-chat-hf. Length: 23
['▁On', '▁ne', '▁voit', '▁bien', '▁qu', "'", 'ave', 'c', '▁le', '▁c', 'œur', '.', '▁L', "'", 'ess', 'ent', 'iel', '▁est', '▁invisible', '▁pour', '▁les', '▁yeux', '.']
Model: gemma-7b-it. Length: 19
['On', '▁ne', '▁voit', '▁bien', '▁qu', "'", 'avec', '▁le', '▁cœur', '.', '▁L', "'", 'essentiel', '▁est', '▁invisible', '▁pour', '▁les', '▁yeux', '.']
LANGUAGE: Spanish. TEXT: En un lugar de la Mancha, de cuyo nombre no quiero acordarme...
Model: Llama-2-7b-chat-hf. Length: 20
['▁En', '▁un', '▁lugar', '▁de', '▁la', '▁Man', 'cha', ',', '▁de', '▁cu', 'yo', '▁nombre', '▁no', '▁qu', 'iero', '▁ac', 'ord', 'ar', 'me', '...']
Model: gemma-7b-it. Length: 15
['En', '▁un', '▁lugar', '▁de', '▁la', '▁Mancha', ',', '▁de', '▁cuyo', '▁nombre', '▁no', '▁quiero', '▁acord', 'arme', '...']
LANGUAGE: Polish. TEXT: Litwo, ojczyzno moja! Ty jesteś jak zdrowie; Ile cię trzeba cenić, ten tylko się dowie, kto cię stracił.
Model: Llama-2-7b-chat-hf. Length: 47
['▁Lit', 'wo', ',', '▁o', 'j', 'czy', 'z', 'no', '▁mo', 'ja', '!', '▁Ty', '▁j', 'este', 'ś', '▁jak', '▁zd', 'row', 'ie', ';', '▁I', 'le', '▁ci', 'ę', '▁tr', 'z', 'eb', 'a', '▁c', 'eni', 'ć', ',', '▁ten', '▁tyl', 'ko', '▁się', '▁d', 'owie', ',', '▁k', 'to', '▁ci', 'ę', '▁stra', 'ci', 'ł', '.']
Model: gemma-7b-it. Length: 32
['Lit', 'wo', ',', '▁oj', 'czy', 'zno', '▁moja', '!', '▁Ty', '▁jesteś', '▁jak', '▁zdrow', 'ie', ';', '▁Ile', '▁cię', '▁trzeba', '▁c', 'eni', 'ć', ',', '▁ten', '▁tylko', '▁się', '▁d', 'owie', ',', '▁kto', '▁cię', '▁stra', 'cił', '.']
LANGUAGE: Ukrainian. TEXT: Як умру, то поховайте мене на могилі серед степу широкого на Вкраїні милій.
Model: Llama-2-7b-chat-hf. Length: 29
['▁Я', 'к', '▁у', 'м', 'ру', ',', '▁то', '▁по', 'х', 'ова', 'й', 'те', '▁мене', '▁на', '▁мо', 'ги', 'лі', '▁серед', '▁сте', 'пу', '▁широ', 'кого', '▁на', '▁В', 'краї', 'ні', '▁ми', 'лій', '.']
Model: gemma-7b-it. Length: 24
['Як', '▁ум', 'ру', ',', '▁то', '▁похо', 'вайте', '▁мене', '▁на', '▁моги', 'лі', '▁серед', '▁сте', 'пу', '▁широ', 'кого', '▁на', '▁В', 'кра', 'ї', 'ні', '▁ми', 'лій', '.']
LANGUAGE: Greek. TEXT: ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ
Model: Llama-2-7b-chat-hf. Length: 53
['▁', 'ἄ', 'ν', 'δ', 'ρ', 'α', '▁', 'μ', 'ο', 'ι', '▁', 'ἔ', 'ν', 'ν', 'ε', 'π', 'ε', ',', '▁', 'Μ', 'ο', 'ῦ', 'σ', 'α', ',', '▁', 'π', 'ο', 'λ', 'ύ', 'τ', 'ρ', 'ο', 'π', 'ο', 'ν', ',', '▁', '<0xE1>', '<0xBD>', '<0x83>', 'ς', '▁', 'μ', 'ά', 'λ', 'α', '▁', 'π', 'ο', 'λ', 'λ', 'ὰ']
Model: gemma-7b-it. Length: 28
['ἄ', 'ν', 'δ', 'ρα', '▁μ', 'οι', '▁ἔ', 'ν', 'νε', 'πε', ',', '▁Μ', 'ο', 'ῦ', 'σα', ',', '▁πολύ', 'τρο', 'πον', ',', '▁', 'ὃ', 'ς', '▁μά', 'λα', '▁πολ', 'λ', 'ὰ']
LANGUAGE: Hebrew. TEXT: בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ
Model: Llama-2-7b-chat-hf. Length: 59
['▁', 'ב', 'ְ', 'ּ', 'ר', 'ֵ', 'א', 'ש', 'ִ', 'ׁ', 'י', 'ת', '▁', 'ב', 'ָ', 'ּ', 'ר', 'ָ', 'א', '▁', 'א', '<0xD6>', '<0xB1>', 'ל', 'ֹ', 'ה', 'ִ', 'י', 'ם', '▁', 'א', 'ֵ', 'ת', '▁', 'ה', 'ַ', 'ש', 'ָ', 'ּ', 'ׁ', 'מ', 'ַ', 'י', 'ִ', 'ם', '▁', 'ו', 'ְ', 'א', 'ֵ', 'ת', '▁', 'ה', 'ָ', 'א', 'ָ', 'ר', 'ֶ', 'ץ']
Model: gemma-7b-it. Length: 39
['ב', 'ְּ', 'ר', 'ֵ', 'א', 'ש', 'ִ', 'ׁ', 'ית', '▁ב', 'ָּ', 'ר', 'ָא', '▁א', 'ֱ', 'ל', 'ֹ', 'ה', 'ִים', '▁א', 'ֵ', 'ת', '▁הַ', 'ש', 'ָּ', 'ׁ', 'מ', 'ַי', 'ִ', 'ם', '▁וְ', 'א', 'ֵ', 'ת', '▁הָ', 'אָ', 'ר', 'ֶ', 'ץ']
LANGUAGE: Arabic. TEXT: حدثني أبي، عن جدي، عن أجداده
Model: Llama-2-7b-chat-hf. Length: 29
['▁', 'ح', 'د', 'ث', 'ن', 'ي', '▁', 'أ', 'ب', 'ي', '،', '▁', 'ع', 'ن', '▁', 'ج', 'د', 'ي', '،', '▁', 'ع', 'ن', '▁', 'أ', 'ج', 'د', 'ا', 'د', 'ه']
Model: gemma-7b-it. Length: 12
['حدث', 'ني', '▁أبي', '،', '▁عن', '▁ج', 'دي', '،', '▁عن', '▁أ', 'جد', 'اده']
LANGUAGE: Hindi. TEXT: कर्मण्येवाधिकारस्ते मा फलेषु कदाचन। मा कर्मफलहेतुर्भूर्मा ते संगोऽस्त्वकर्मणि।
Model: Llama-2-7b-chat-hf. Length: 85
['▁', 'क', 'र', '्', 'म', 'ण', '्', 'य', 'े', 'व', 'ा', 'ध', 'ि', 'क', 'ा', 'र', 'स', '्', 'त', 'े', '▁', 'म', 'ा', '▁', '<0xE0>', '<0xA4>', '<0xAB>', 'ल', 'े', 'ष', 'ु', '▁', 'क', 'द', 'ा', 'च', 'न', '।', '▁', 'म', 'ा', '▁', 'क', 'र', '्', 'म', '<0xE0>', '<0xA4>', '<0xAB>', 'ल', 'ह', 'े', 'त', 'ु', 'र', '्', 'भ', 'ू', 'र', '्', 'म', 'ा', '▁', 'त', 'े', '▁', 'स', 'ं', 'ग', 'ो', '<0xE0>', '<0xA4>', '<0xBD>', 'स', '्', 'त', '्', 'व', 'क', 'र', '्', 'म', 'ण', 'ि', '।']
Model: gemma-7b-it. Length: 40
['क', 'र्म', 'ण', '्य', 'े', 'वा', 'धिक', 'ार', 'स्त', 'े', '▁मा', '▁फ', 'ले', 'ष', 'ु', '▁क', 'दा', 'च', 'न', '।', '▁मा', '▁क', 'र्म', 'फल', 'हे', 'तु', 'र्भ', 'ू', 'र्', 'मा', '▁ते', '▁संग', 'ो', 'ऽ', 'स्त', '्व', 'क', 'र्म', 'णि', '।']
LANGUAGE: Chinese. TEXT: 道可道,非常道。名可名,非常名。
Model: Llama-2-7b-chat-hf. Length: 17
['▁', '道', '可', '道', ',', '非', '常', '道', '。', '名', '可', '名', ',', '非', '常', '名', '。']
Model: gemma-7b-it. Length: 14
['道', '可', '道', ',', '非常', '道', '。', '名', '可', '名', ',', '非常', '名', '。']
LANGUAGE: Japanese. TEXT: 私は、その男の写真を三葉、見たことがある。
Model: Llama-2-7b-chat-hf. Length: 24
['▁', '<0xE7>', '<0xA7>', '<0x81>', 'は', '、', 'そ', 'の', '男', 'の', '写', '真', 'を', '三', '葉', '、', '見', 'た', 'こ', 'と', 'が', 'あ', 'る', '。']
Model: gemma-7b-it. Length: 12
['私は', '、', 'その', '男', 'の写真', 'を', '三', '葉', '、', '見た', 'ことがある', '。']
At first glance, Gemma might seem like just another LLM – similar to the many others released by the open-source community each month. However, it's important not to be misled; this is huge news for the LLM field. With Gemma's introduction, companies worldwide now have the ability to fine-tune open-access LLMs for a variety of applications. This includes creating conversational translators, multilingual writing assistants, and Retrieval-Augmented Generation (RAG) systems. This release effectively bridges a significant, albeit previously overlooked, gap in the LLM landscape. It marks the arrival of the first truly multilingual chat LLM, a development we can all celebrate.
Building your custom LLM could enhance data security and compliance and enable an AI competitive advantage for your product. You can check our other posts to get an extensive explanation of what the network effect is and how AI enables it, how to build an AI competitive advantage for your company, what culture helps you build the right AI products, what to avoid in your AI strategy and execution, and more.
If you need help in building an AI product for your business, look no further. Our team of AI technology consultants and engineers have years of experience in helping technology companies like yours build sustainable competitive advantages through AI technology. From data collection to algorithm development, we can help you stay ahead of the competition and secure your market share for years to come.
Contact us today to learn more about our AI technology consulting offering.
If you want to keep posted on how to build a sustainable competitive advantage with AI technologies, please subscribe to our blog post updates below.
Comments