March 17, 2022

Direct Line with Saurabh Tiwary: How Do Large Language Models Work?



Working with M12 means access to some of the amazing minds at Microsoft. This has given us the privilege to name thoughts leaders like Saurabh Tiwary, Corporate Vice President for Microsoft Turing as a scientific advisor to M12. 
I recently had the chance to sit down with Saurabh to discuss his views on large language models and implications for natural language processing. In this second part of our three-part series, we focused on how large language models work.
Don’t miss the first part of our series on why large language models matter, or our conclusion with what the Microsoft Turing team sees as next for language learning models.  
Do you believe these models are learning systematically or some schools of thought say that they are more like giant databases with complex query languages?

Definitely. There are different schools of thought relating to what these large language models are learning. I think from a practical perspective, what these models can do is a lot more interesting question than what these models learn. More specifically, in the machine learning space we have been taught that memorization is relatively uninteresting compared to generalization. However, if we think about search engines like Google or Bing, as end users, do we really care whether in the back end they are memorized or generalized systems? For example, if we had a new system that could “memorize” all the right search results for all the past and future queries in the world, that system would be highly valuable compared to the current generation of search engines. 

During their training phase, these large language models  are able to see almost all of the digitized textual content available on the internet and they have amazing interpolation powers. Hence, even though they technically might be bucketed into giant memorization machines, they are extremely valuable from a practical perspective due to their ability to access the content that these machines have memorized. 

On a more philosophical note,, we often argue that humans are creative and different from such machines; but a large fraction of what we do and what we say are interpolations and extrapolations based on what we have read, observed, or experienced in the past. For example, how I am saying things, or the style of English that I am using to convey my thoughts in this response itself is very likely an interpolation of past writings and conversations that I have encountered. There are definitely moments of creativity, but I would say they are not as frequent as you may expect 🙂 . As the models are becoming more powerful through their size and ingestion of mind-boggling quantities of data, they have started demonstrating very impressive capabilities that are both impressive and practical in terms of enabling people to do more. 

Related question on what representations these models are learning – will they always be susceptible to perturbations in inputs that humans aren’t and thus adversarial attacks?

Perturbations or adversarial attacks are very popular mechanisms to make the models fail. However, I do not think these are major roadblocks. In deep learning, we have a general framework to identify loss functions and to optimize against them. If the goal is to limit the unintended outcomes from adversarial attacks, this can be captured as part of the training process and the models could be made more robust. Many production systems already incorporate such approaches. Now, another school of thought suggests that such attacks will be like the spam scenario where the adversary continually invents new methods which pass through the gates of the trained system. Again, I am bit more optimistic here. For example, just like the spam problem, the systems would become generic and strong enough to limit such adversarial cases to the realm of one-off false positives. For example, we don’t fret on a daily basis regarding spam or furiously check our spam folders every day when we work with email. Deep learning models have this strong power of generality which further helps in this domain beyond just direct injection of adversarial training data. 

One practical challenge we might face is that as systems become better, our expectations grow as well. Let’s take the spam example. If tomorrow, our email systems start leaking spam at high rate into the primary mailbox, people will be infuriated and if this persists longer, they might even start considering changing providers. This is a natural risk for large language models too. As they get more widely deployed and start helping us more in our productivity and social tasks, any failures through adversarial attacks or otherwise would get highlighted more. 

Another tidbit is that spam classifiers would likely improve even further as they start incorporating these large language models. The big challenge in the space is inference cost since spammers can generate emails at very high volume and you may want predictions at extremely low dollar amounts per email. 

Related question – is there any systematic way we can test for systemic generalization ability of LLMs? Are the established NLP benchmarks good enough?

This is one area where the advancements in deep learning seem to be outpacing agreed upon standards. Due to decades of interest in artificial intelligence, the academic community had developed a good set of benchmarks to measure and understand the progress of deep learning models. However, the jumps in model quality have come in such a short amount of time that most of these benchmarks have become saturated. There are some popular leaderboards, but models have made so much progress that moving to the top of the leaderboard means one needs to do custom things to fit to the nuances of the dataset that the leaderboard is using as a surrogate for model quality. The equivalence that one can draw is with the 100 meters sprint in Olympics. Even though the intent of identifying the fastest man was that the person would be a great sprinter. However, to win the gold medal, one has to focus on a lot of other things apart from being a good runner, like getting off the blocks quickly, pushing the neck and body forward near the finish line, adjusting to wind speed, etc. Similarly, for some of these leaderboards, models sometimes have to fit to the noise in the data to go to the top. In general, Goodhart’s law kicks-in – when a measure becomes a target it ceases to be a good measure.

We are seeing new benchmarks being produced at regular intervals. But very often they are overwhelmed by model advancements. There have been instances where human parity is achieved even before the dataset paper is officially published in a conference. In general, this is one area where we need serious investment towards meaningful benchmarks so that progress could be measured and directed in the right direction.  Some of the practical benchmarks that we use are real world production datasets. For example, in question and answering, in real production datasets, we see challenges like various flavors of spelling errors that users type, implicit and explicit questions for which they are seeking answers, punctuation is often missing, grammatical correctness is not guaranteed, code mixing, etc. One could artificially generate such sets similar to what SQUAD has done. But having the production datasets provides natural distributions across these challenging scenarios that are real and if a new model improves related metrics, there is a high likelihood of making real world impact. 

We have seen examples of a technique called priming where you give a LLM a bunch of examples a task you want it to learn and then ask it to start generating examples of task completion. This is supposed to give us a way to probe what syntactic structure a LLM has actually learned. What are the limitations of this approach and any new insights you and your team have gathered in why it works and limitations for when it won’t?

Few shot set-ups, also known as priming, for large scale generative models has become an interesting way to leverage these models. For other models, both classical machine learning based as well the deep learning based, it needed a non-trivial effort to make the model usable. One had to custom train or fine-tune the model for the application scenario. However, for the pre-trained generative models, we can give a few examples, to “encourage” the model to express the behavior that we want from it and the models do reasonably well on the said task. This is akin to the no-code paradigm and speeds up the prototyping process. For example, if we wanted a tonal re-write for user generated content – let’s say we identify a YouTube comment as being harsh and we want to soften the tone to a more acceptable one, we could give a few examples of input-output pairs and then let the model rewrite for the comment in question. I think the development is similar to WYSIWYG editors whereby users don’t need detailed typographic knowledge to create printable content. A non-sophisticated user could directly interact with the system and benefit from it. The same trend is showing up with this priming behavior. This, in my view, will go a long way towards democratizing cutting-edge AI for the masses.

One thing I would like to call out is that few-shot set-ups don’t mean that this is the best that the model could perform. If we were to fine-tune the model (using commonly known methods in this space), the quality of model output would improve compared to priming. However, priming helps with quick evaluation of whether a particular application of the model makes sense or not and to speed up the innovation process through fast prototyping. 

With a priming set-up for our MT-NLG model, we found that the model could answer questions from the TV show Jeopardy out-of-the-box at very high accuracy. In the past, when IBM did the Jeopardy challenge, they had a relatively large, dedicated team working on this grand challenge with different people working on building different parts of the system. However, with MT-NLG 530B model, one of our team members just tried the idea of phrasing some question-answer pairs from Jeopardy TV show and then started asking questions from new episodes. We observed that the model did impressively well. We would not have discovered this capability if not for the priming feature. There are other examples that we are learning about the model’s capabilities every few days. For example, the model can not only solve riddles but explain for each of the lines of the riddle why its answer makes sense expressing capabilities that we might label as reasoning and comprehension.