In the natural language processing space, when people refer to large language models they actually refer to two distinct classes of models: representation models and generative models.
For most practical purposes, BERT with its use of transformers for NLP, triggered the deep learning revolution for language. This and other similar models in its class XLM-R from Facebook or T-NLR and T-ULR from Microsoft Turing are in the class of language representation models. At a very high level, representation models project language into a vector space which leads to good representation of any string of text which computers can reason over. More interestingly, these models now support not just English but 100+ languages. So, the reasoning could happen across all these languages. Most of the industrial applications need good language understanding and hence these representation models are the ones which are heavily deployed in various production scenarios serving billions of customers every day. They help power capabilities like search, classification, sentiment analysis, intent detection, entity detection, etc. As an example, for sentiment analysis, the representation models project words, phrases, and sentences in the vector space. Using these projections and a few labeled examples, they can generalize when a new phrase or sentence is input. They can predict, based on the neighborhood in these vector space, whether that new sentence is “semantically similar” to a previously seen sentence and hence predict its sentiment.
However, there is another class of language models that are generative in nature – that is, given a prefix they try to predict the next word or phrase. Even though this is a simple strategy, what one can do is in the next turn, we input the prefix including the generated word as the new prefix and ask the model to generate the new next word. This strategy can be used to create free-form text like writing a story or helping write emails in Microsoft Outlook or documents in Microsoft Word, etc. The recent excitement in NLP space is triggered by these applications. Since traditional machine learning models were very weak, there were not many good pre-existing application areas where these generative models could replace their traditional counterparts. However, modern generative models are opening up completely new scenarios like source code generation or predicting protein structure. However, this is just the tip of the iceberg. The quality of the models is such that I expect an inflexion point happening between now and the next 2-3 years where completely new experiences will be built and the way we work and do things might be redefined as new use cases get identified. The recent availability of the GPT-3 API through Azure might help democratize the exploration of these new experiences and business models.
If I have to project it out, there are a few limitations that I foresee. The first one is us running out of training data for these large language models. During the training of MT-NLG 530B, we were already kind-of exhausting all sources of publicly accessible meaningful training data. One way around it is if we start incorporating multi-modal training data which can help push the “size implies quality” axis further a couple more orders of magnitude.
Second, the universal nature of these LLMs does help bridge the language divide and we now have systems that work impressively well in low resource languages. However, there would likely remain a, albeit smaller, divide in performance of these models between low resource languages and their high resource counterparts.
In general, as models get stronger the limitation of training data may start exhibiting itself in ways that we haven’t experienced yet. Think of it as interacting with a straight-A student who has missed the classes on permutations but knows everything else. So, you have a near perfect conversation and suddenly you feel the interaction falling off a cliff and the student trying to do their best to maintain conversation without knowing the ground truth. Similarly, capturing social and cultural constructs would remain challenging by these models as they are hard to describe in electronic medium.