Many observers are puzzled by the fact that AI-driven chatbots can be smart and dumb at the same time. On the one hand, they are sometimes capable of solving really difficult medical diagnostic puzzles. But sometimes they struggle to perform straightforward mathematical calculations that old school computers and calculators can easily do. This dichotomy is the result of a basic difference between the two types of technology. Calculators and computers are deterministic by design while chatbots that rely on generative AI are probabilistic. In plain English, the former rely on a set of predetermined rules—5 x 5 always equals 25, it doesn’t probably equal 25—while chatbots make predictions based on probabilities, and those probabilities can vary depending on what’s contained in the data set it uses and what sequence of words and sentences in the data set it chooses to focus on. That explains why you can occasionally get very different answers to the exact same question when you query the bot. A closer look at the technology behind large language models like ChatGPT can shed light on why these digital tools make these mistakes.
As we pointed out in part 1 on this series, if you start with a simple sentence like “Where are we going” and leave out the word “going”, and then ask a computer to predict that word, the probability looks like this:
P(S), the probability of the completed sentence = P of (where) x the P of (are given where) x P of (we given “Where are”) x P of (going given “Where are we”). If instead of predicting the above sentence, the training data included a different sentence: Where are we at, the same probability would exist. And if the data set consisted of only these two sentences, the probability of correctly predicting the missing word— going or at— would be 0.5. A LLM learns such probabilities on a massive scale. While a predictive model used to classify a skin lesion as melanoma or a normal mole typically will use a data set of hundreds or thousands of samples, LLMs use billions of data points. Some speculate that ChatGPT-4, from OpenAI, used a trillion samples.
To understand how these Chatbots work, it helps to deconstruct the term. The term chat is pretty obvious. These models talk to users, either with words, images, or even computer code. GPT stands for generative pre-trained transformer. These tools are generative in the sense that they are generating or creating new information, which can be accurate, inaccurate, or complete fabrications. They are pre-trained because the model deliberately masks some of the words in the corpus of training data. The partially masked data is inserted into a complex transformer program that uses various encoders and decoders; then the Chatbot is told to predict the missing words, i.e. the data that had been masked.
LLMs rely on transformers—a type of neural network–and a technological innovation called self-attention. One of the best ways to understand it is with an example of how LLMs do language translation. The original article that put transformers in the spotlight was written by Google and University of Toronto data scientists. Entitled Attention is all you need, it used 2 test sentences to show how accurate a LLM could be in translating English to German and English to French. The latter task looked like this:
English: “The economic situation in Europe remains challenging, but there are signs of recovery.”
French: “La situation économique en Europe reste difficile, mais il y a des signes de reprise.”
If you were to translate each English word into French and line them up in the same order as they appear in the English sentence, it wouldn’t make any sense because French doesn’t use the same positioning of parts of speech as does English. There are neural networks that can handle this problem by switching around the word order to find the sequence of words that makes sense in the foreign language. Recurrent neural networks (RNNs) can do that to a limited extent. But Dale Markowitz with Google has pointed out that RNNs can’t handle really large sequences of text like essays and they are slow to train because they can’t be “parallelized.” In plain English, that means they can’t be processed through a long chain of graphic processing units, the GPUs that are stacked up side by side in a computer.
LLMs that use transformer technology can be parallelized and thus can be trained on massive amounts of data—often many terabytes. The technology relies on positioning encoding and the self-attention mechanism. RNNs look at words sequentially, but transformers assign a number to each of the words in the text document being analyzed. In the case of the English sentence above, “The” would be tagged 1, “economic” 2, “situation” 3 and so on. This is especially useful if you have an entire essay with hundreds of words. Then the neural network learns how to interpret this encoding while it is being trained on the millions of sentences in its data set.
Transformers also know how to pay attention to certain words by analyzing thousands or millions of English/French sentence pairs. That teaches them the rules of grammar, usage, gender, and so on that are specific to each language.
Using the self-attention mechanism, these models gain a deeper understanding of how language works. They can pick up on synonyms, grasp verb tense, and so on. This process involves having the transformer turn its attention inward, looking at the input text that it is required to interpret. As Markowitz points out, self-attention allows the model to “understand a word by looking at the context of the words around it.”
Of course, this rudimentary explanation of transformers glosses over many important feature of the technology. The architecture for this innovation was graphically represented by this figure from the original Google paper. There are several YouTube tutorials to help your decipher its meaning.
While the technology behind the curtain may be quite complex, one take home message remains the same: LLMs are math, not magic.
This piece was written by John Halamka, MD, President, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform. To view their blog, click here.
Share Your Thoughts
You must be logged in to post a comment.