What is it?

GPT-3 is likely the most computationally-expensive machine learning model. The neural network’s 175 billion parameters make it about ten times larger than the previous largest language model (Turing NLG, 17 billion parameters, released by Microsoft in February 2020). The 430GB of text GPT-3 was trained on was drawn widely from the internet and supplemented with text from books. The model works by seeing some amount of text that has come previously (up to a maximum of about 2,000 words) and predicting the next word to generate novel text.

Users interact with the model by providing a prompt. An example prompt for a chatbot-style interaction from OpenAI (the organization that created GPT-3) is “The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly”. In addition to supplying a prompt, users are able to specify certain parameters for things like how long the output should be, how likely words are to be repeated, or the randomness of the output.

What can it do?

GPT-3 demonstrates reasonable proficiency on almost all standard natural language processing benchmarks, including state-of-the-art performance on a few of them. The benchmarks include challenges such as using a paragraph of context to predict the last word of a related sentence and determining which noun a grammatically ambiguous but contextually unambiguous pronoun refers to. Other benchmarks involve translating between languages and answering general knowledge questions. This proficiency was achieved without the task-specific fine-tuning that most cutting-edge models use. GPT-3 is capable of being fine-tuned, and further fine-tuning would almost certainly improve the results of the model on each of the specific benchmarks (at the expense of worse performance outside of the task it was fine-tuned on).

OpenAI also tested GPT-3 on some non-standard tasks:

Generating News Articles

A sample of around 80 people was asked to distinguish between real articles and articles with the last 200 words generated by GPT-3. The participants were unable to reliably distinguish between the real articles and those completed by GPT-3 (participants correctly categorized 52% of the articles they saw, 50% was within the 95% confidence interval). The participants did not improve their accuracy when the amount of text generated by the model was increased to 500 words (accuracy stayed at 52%).

SAT Analogies

When asked to complete SAT analogy problems, the model correctly answered 14% more problems than an average college applicant.

Arithmetic

The chart below shows the accuracy of the model when it is prompted with several example math problems and then asked to answer one. The results for the model I’ve been referring to as GPT-3 is on the far right (175B). OpenAI created several versions of the model to test how performance varied across different model sizes. Larger models show a marked improvement.

Arithmetic_chart

Overall, the model is able to successfully answer two-digit addition and subtraction problems reliably. For all other problem types, the model is not able to consistently give the correct answer but is significantly better than chance.

Metrics are one thing, but the best way to get a feel for the capabilities of the model is to see the outputs. Many people are demonstrating potential use cases for GPT-3. Here are some highlights:

Creating layouts in JavaScript

Creating an API in Python

Creating functions in Python

Summarizing an NDA for 2nd grader

Writing like an attorney

“Search engine”… that isn’t searching anything, it just knows the answers

Writing poetry

More project links

Of course, it’s hard to judge the model based solely on a few cherry-picked examples. It seems to be relatively easy to demonstrate impressive capabilities. Generating results that are reliably good enough to use in some sort of production setting (ie. as a customer service bot) is a very different story. It’s likely that the model will be most useful in either systems with a human in the loop (perhaps generating a suggested response for a human to approve or edit), or for applications that don’t require consistently good results (such as generating fun fictional stories like AI Dungeon).

How can I use it?

The model will be available through an API. OpenAI currently has a private beta release of the API with a waitlist you can sign up for here. Pricing information for the API hasn’t been announced yet, but we know that the electricity costs of generating 100 pages of content from the model are a few cents. A cost of using the API in the range of $0.50 to $5 per 100 pages generated would seem to be reasonable in order to pay back the initial costs of creating the model, but it’s hard to say.

Alternatively, you can access the model through AI Dungeon. Note that the free tier of AI Dungeon uses text generated by GPT-2, not GPT-3. In order to use GPT-3, you will need to sign up for the paid version (though the first 7 days are free). After signing up, you will need to change the settings to use the “Dragon” model (aka GPT-3) as opposed to the “Griffin” model (aka GPT-2). The paid version also includes an option for custom prompts (“scenarios”) which means you don’t need to use the standard story prompts.

What’s new here?

Firstly, the wide-ranging capabilities of the model exceed what is publicly available. It’s difficult to predict what people will be able to make with the model, but it’s likely the model will be used in new ways and improve results in areas where language models are already used.

In addition to the practical new uses of the model, there are some interesting takeaways from the research:

Bigger models are better

Perhaps the most important point is that larger models continue to perform better. Prior to GPT-3, researchers had observed a power-law relationship between model size and performance. They saw that there were diminishing returns to using additional computational power during the training of models, but still significant performance gains for more expensive models. Despite the trend at lower levels of computation, there was some debate about how far that trend could be extrapolated. After GPT-3 it’s still not clear where the limits of that trend may be, but we haven’t reached them yet. Despite GPT-3 being ten times larger than the previous largest model, it’s performance is what would be expected from the previously observed trend.

Validation Loss

The above graph shows model performance (lower is better) across a range of model sizes and computational expenditure. GPT-3 is the yellow line, and the power-law represented by the dotted line seems to be holding at all the model sizes OpenAI tested.

There are multiple estimates for how much it cost OpenAI to train GPT-3. One estimate says $4.6 million. Another says $12 million. Neither includes researcher compensation. Regardless of the true number, the takeaway doesn’t change. GPT-3 was extraordinarily cheap to produce given its potential applications, and larger models will likely follow. Google spent much more on food in 2008 than OpenAI just spent to create a state-of-the-art language model with commercial applications. There’s plenty of money to push towards larger models if that direction is deemed promising enough. After GPT-3 it’s hard to argue against larger models being significantly more effective. Funding is not the only constraint towards creating more powerful models. There is a significant amount of novel engineering that needs to be done to train this kind of model, but OpenAI is not the only organization with the talent to accomplish that.

Meta-learning

The fact that GPT-3 has the ability to do arithmetic, when only very few of the specific problems it was tested on were likely to be in the training data, implies the model is somehow actually learning how to do the mathematical operations. That point is further supported by the authors of the paper stating, “inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a ‘1’, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table”. GPT-3 also correctly answers about 20% of single digit combined operations (for example, 9*(7+5)) – a rate much better than random chance. It is remarkable that a model trained simply to predict the next word in a text appears to be learning how to do math in order to better predict the next word. These results raise questions about what new capabilities models might acquire at a significantly larger scale. For example, could a sufficiently powerful language model read thousands of scientific papers and use that data to successfully predict the results of novel experiments?

Few-shot Learning

Most large, publicly available machine learning systems take the approach of doing a large amount of training on some sort of generalized data and then fine-tuning the model on domain-specific data. GPT-3 demonstrates proficiency in many domains by replacing the fine-tuning step with what OpenAI has dubbed “few-shot learning”. Few-shot learning is simply showing the model a few successful examples of what you want it to do in the prompt the model is given. For example, a prompt to get the model to successfully answer general-knowledge questions might look like this, with the last question being the one you want GPT-3 to answer.

Prompt

It is also possible to use the model by providing a prompt with no examples (“zero-shot”) or one example (“one shot”), but the model generally performs better the more examples it sees.

The few-shot learning approach has several benefits:

  • First, few-shot learning may make machine-learning more accessible. The pool of people who would feel comfortable entering a prompt like the one above is MUCH larger than the pool of people who have the technical knowledge to fine-tune a model.

  • Second, prompting models in this way may enable machine-learning models to be used in domains where acquiring the large amounts of structured training data necessary for fine-tuning is infeasible.

  • Lastly, few-shot learning makes the model more flexible. With the typical fine-tuning approach, the underlying model weights are actually changed for a specific task, so fine-tuning sacrifices generalizable performance for better performance on a particular application of the model. By contrast, a model that uses the few-shot learning approach does not change the underlying model.

As the graph below shows, few-shot learning works better the larger the model is. Few-shot learning is not just a viable alternative to fine-tuning with the current state of machine learning, it will continue to get more effective with larger future models. The increasing effectiveness of few-shot learning combined with the direct performance gains from increasing model size will likely cause a trend towards larger models that use few-shot learning.

Aggregate_perforamnce

How powerful can language models get?

This paper by OpenAI investigates the scaling of language models. The researchers treat model performance as a function of the size of the model, the amount of training data, and the computational power used to train the model. They find a power-law relationship where more scaling the inputs reliably leads to better performance. Although the paper was written prior to GPT-3, the new model is consistent with the relationship they found even though it is at a scale much greater than they were able to test. The researchers extrapolate the trend to find the point at which a model (using the optimal ratio of inputs) would reach the theoretical maximum performance of a similar language model – a point where all of the information has been extracted from text. It’s entirely possible that this pattern will break for unforeseen reasons before reaching that point. If the trend holds however, the researchers estimate that maximum performance would be reached by a model with about 1 trillion parameters, trained on 1 trillion tokens (about 1.4 terabytes), and using about 10,000 petaflop/s-days of compute (pg. 17).

The paper cautions, “the numerical values are highly uncertain, varying by an order of magnitude in either direction depending on the precise values of the exponents from the power-law fits. The most obvious interpretation is that our scaling laws break down at or before we reach this point, which is still many orders of magnitude away in both compute and model size”. That was written before GPT-3, and GPT-3 is within an order of magnitude now. The equation from that paper predicts training loss to be 1.75 with 10,000 petaflop/s-days of compute, while the updated equation from the GPT-3 paper predicts a training loss of 1.65. After updating the trend line with the newest data from GPT-3, the theoretical best language model appears more achievable than the previous paper (and the numbers here) show.

It’s worth noting that, assuming the trend doesn’t break down, it likely underestimates future performance. The relationship does not account for future improvements in training techniques. OpenAI has used a consistent process for training various versions of their GPT model, but other researchers have continued to improve the training process of similar models. GPT-3 was not trained in a cutting-edge way.

Model_table

If a next-generation model scales as much as GPT-3 did, it will be well beyond the theoretical best model predicted by the power-law that has been observed so far. If the trend breaks, we’ll get important information about the limits of current approaches. If the trend doesn’t break, we’ll be living in a very different world.

Further exploration

GPT-3 paper

OpenAI blogpost

Gwern‘s post

Lambda Labs post

Lambda Labs aggregates and summarizes other content

Good overview

Slatestarcodex post

Analysis of potential constraints to scaling future models

Examples of uses, and details about API parameters

Computerfile video

Collection of more demos and articles