March 17, 2022

Direct Line with Saurabh Tiwary: Why Large Language Models Matter



Working with M12 means access to some of the amazing minds at Microsoft. This has given us the privilege to name thoughts leaders like Saurabh Tiwary, Corporate Vice President for Microsoft Turing as a scientific advisor to M12. 
I recently had the chance to sit down with Saurabh to discuss his views on large language models and implications for natural language processing. It was a thought-provoking discussion to hear Saurabh’s perspective on why they matter, how they work and what the Project Turing sees as lying ahead. 
We’ve split the conversation into three parts, starting with why large language models matter, then how they work, then what the Microsoft Turing team sees as next for language learning models.
Let’s start with Project Turing at Microsoft. Can you tell us a bit about how this effort came about and the strategic relevance of it to Microsoft?

Our goal at Project Turing is to realize Microsoft’s mission to empower people and organizations everywhere by helping them choose Microsoft for its AI. We approach this mission in two ways: 
First, as a center of AI excellence within Microsoft, we focus on building world class AI for Natural Language Processing (NLP) and related technologies. These are AI models that are broadly known as “large language models” and they form the foundation of most AI technologies in NLP these days. These foundational models fall into two broad categories of either language representation models or language generation models. Within each category, they can either support a single language (primarily English) or be universal where they can support as many as 100+ languages. 

Second, we are democratizing access to the capabilities of these models within and outside of Microsoft. We provide AI developers with direct access to these AI models through an Azure private preview program, and we work closely with some of the top academics in the NL community through our Microsoft Turing Academic Program (MS-TAP). 

To truly realize our mission, we must ensure that the capabilities of these models enrich and enhance the experiences of billions of Microsoft customers worldwide. With that challenge in mind, we build end-to-end solutions to improve experiences and services across a broad suite of Microsoft products. This includes experiences like text prediction, which is available in Microsoft Word, Teams, and Outlook; the question answering system on Microsoft Bing; semantic search capabilities for SharePoint; AI-powered experiences in the Microsoft Edge browser; machine translation service and QnAMaker for Azure Cognitive Services; and more.

Large language models (LLMs) are generating a ton of buzz. What is your view on where LLMs will excel and, perhaps, where they will hit limitations?

In the natural language processing space, when people refer to large language models they actually refer to two distinct classes of models:  representation models and generative models. 

For most practical purposes, BERT with its use of transformers for NLP, triggered the deep learning revolution for language. This and other similar models in its class XLM-R from Facebook or T-NLR and T-ULR from Microsoft Turing are in the class of language representation models. At a very high level, representation models project language into a vector space which leads to good representation of any string of text which computers can reason over. More interestingly, these models now support not just English but 100+ languages. So, the reasoning could happen across all these languages. Most of the industrial applications need good language understanding and hence these representation models are the ones which are heavily deployed in various production scenarios serving billions of customers every day. They help power capabilities like search, classification, sentiment analysis, intent detection, entity detection, etc.   As an example, for sentiment analysis, the representation models project words, phrases, and sentences in the vector space. Using these projections and a few labeled examples, they can generalize when a new phrase or sentence is input. They can predict, based on the neighborhood in these vector space, whether that new sentence is “semantically similar” to a previously seen sentence and hence predict its sentiment.

However, there is another class of language models that are generative in nature – that is, given a prefix they try to predict the next word or phrase. Even though this is a simple strategy, what one can do is in the next turn, we input the prefix including the generated word as the new prefix and ask the model to generate the new next word. This strategy can be used to create free-form text like writing a story or helping write emails in Microsoft Outlook or documents in Microsoft Word, etc. The recent excitement in NLP space is triggered by these applications. Since traditional machine learning models were very weak, there were not many good pre-existing application areas where these generative models could replace their traditional counterparts. However, modern generative models are opening up completely new scenarios like source code generation or predicting protein structure. However, this is just the tip of the iceberg. The quality of the models is such that I expect an inflexion point happening between now and the next 2-3 years where completely new experiences will be built and the way we work and do things might be redefined as new use cases get identified. The recent availability of the GPT-3 API through Azure might help democratize the exploration of these new experiences and business models.

If I have to project it out, there are a few limitations that I foresee. The first one is us running out of training data for these large language models. During the training of MT-NLG 530B, we were already kind-of exhausting all sources of publicly accessible meaningful training data. One way around it is if we start incorporating multi-modal training data which can help push the “size implies quality” axis further a couple more orders of magnitude. 

Second, the universal nature of these LLMs does help bridge the language divide and we now have systems that work impressively well in low resource languages. However, there would likely remain a, albeit smaller, divide in performance of these models between low resource languages and their high resource counterparts. 

In general, as models get stronger the limitation of training data may start exhibiting itself in ways that we haven’t experienced yet. Think of it as interacting with a straight-A student who has missed the classes on permutations but knows everything else. So, you have a near perfect conversation and suddenly you feel the interaction falling off a cliff and the student trying to do their best to maintain conversation without knowing the ground truth. Similarly, capturing social and cultural constructs would remain challenging by these models as they are hard to describe in electronic medium.