The Future of AI Is Expensive
Why large AI models have been costing more and more, and why they’ll only get more expensive
September 2023 | subscribe here
“What is the future of AI?”
Optimists will say anything’s possible. Once AGI is achieved, menial labor and world hunger will be solved problems.
Doomers also acknowledge that AI will become superpowerful, but they see society collapsing from bioweapon formulas becoming readily available and mass unemployment.
Lastly, the engineers trying to build AI applications today only see an API that works kinda well, most of the time, but is expensive to run.
Waxing about the distant future is easy, but, no one can say what’s coming in the next 3 to 5 years. I wanted to understand where AI is going, and why. So here’s the culmination of a year of reading Scaling Laws papers, speaking with OpenAI research scientists, and lots of conversations with ChatGPT.
How good will AI models be? What are the main obstacles to getting much better? And how much will this all cost?
1) Why did AI explode in the last few years? Big models got even bigger
Let’s start this investigation in the past – how did OpenAI turn the tech world upside down?
The answer is compute, and it didn’t start with OpenAI.
- AlexNet (2012): One of the first “big models” and was made for image classification. 60M parameter was big at the time (0.2% the size of the largest models today) and made waves by beating state-of-the-art.
- Resnet (2015): An even bigger model, which improved accuracy by another 10% on top of AlexNet, but it cost 10x more expensive to train– a prescient pattern of compute’s diminishing returns on performance.
- Transformer (2017): Released by Google with their famous “Attention is All You Need” paper. TLDR – with a fancy new algorithm, NLP models could be much better at learning human language.
- GPT 2 (2019): OpenAI took note of Transformers and decided they were the “innovation” local maxima. Instead of inventing a more clever algorithm, OpenAI stuck with transformers, but dumped enormous compute to make models that were much, much bigger. GPT 2 was the result – it required 8x more training compute than Google’s Transformer model.
- GPT 3 (2020): OpenAI kept building bigger models, and released GPT 3, which needed a whopping 1,662x more compute than GPT 2.
And thus, the compute arms race begins.
Google, Meta, and even Nvidia / Microsoft made their own models, each bigger than the last.
How was this all possible? Compute costs decreased dramatically, enabled by (1) decreasing GPU hardware costs and (2) better training algorithms. In 2019, a v2 TPU cost $4.50/hour and now a v4 costs $2/hour. Overall, OpenAI showed that compared to 2012, training AlexNet (that first “big model”) costs 44x less compute in 2020.
But the cost curve couldn’t keep up with model size growth. By 2021 and 2022, the models exploded in cost. Meta’s LLaMA models cost between $5-8M, and Google’s enormous PaLM 1 and 2 models cost $25M and $75M, respectively. GPT 4 is rumored to cost between $100-200M, all for a single training run.
2) Performance gains have been costing more and more
Have these huge models been getting better over time? The short answer – yes, but it’s costing a lot more in compute.
The earlier models saw pretty crazy improvements, based on the MMLU.
But model improvements quickly started slowing.
x
-
OpenAI’s GPT 3 vs. GPT 2: 70% improvement, but cost ~$3M more
- Meta’s LLaMA vs GPT 3: 17% improvement but cost ~$3M more.
- Google’s PaLM vs. LLaMA; 20% improvement but cost ~$25M more.
- GPT 4 vs. PaLM 2: 6% improvement but cost ~$100M more.
To be fair – this is not a perfect apples-to-apples comparison. The MMLU is a very broad multiple choice test, and is a limited measure of a model’s “intelligence.” MMLU performance can’t measure generative ability, the huge token window increase, how it can respond to images with text, etc.
Regardless, even if this illustrative example is exaggerated, it’s undeniable that companies are dumping larger and larger sums of money into compute. We’re suffering from “compute inflation” – each generation of model is 10x more expensive than the last, and the incremental dollar of compute is improving performance less.
So, then, how do we improve performance from here? Turns out, there three options:
1. Make the models bigger
x 2. Use more training data
x 3. Invent a better algorithm
Which method will work, and why? Let’s dive in.
3a) Why models can’t get bigger
Why can’t we make the models just bigger? Earlier on, that was exactly the strategy.There were 4 distinct eras when looking at model size:
x
- 2020-2021: Models got much bigger, without using much more data. Compared to GPT 2, GPT 3 had a 100x increase in parameter size but only a 15x increase in training token data. And it worked back then – GPT 3 was much better than GPT 2.
- Late 2021: Models got enormous. MT-NLG, Nvidia and Microsoft’s overweight baby, was a 530x increase in parameter size but only 13x increase in training token data. It was a goliath model that barely used any training data. This model underperformed across almost all benchmarks.
- 2022: Reset to smaller models with more data. Google’s Chinchilla paper showed that a compute-efficient model required a higher ratio of training data to parameter size. This makes sense conceptually. An AI model is only as good as the training data you dump into it.
- 2023: Large models with proportional amounts of data. OpenAI took note of these new scaling laws. The two largest iterations, GPT-4 and PaLM, are both huge, but also use a lot more training data.
So, looks like we can’t make the models arbitrarily bigger to eke out better performance, at least not without more data. Even Sam Altman agrees.
3b) More data then, but where?
So, data, then. Exactly how much more data is out there for AI models to use? Lots of people have looked at this already (Villalobos 2022, Dynomight post). Turns out, quite a bit. The models currently use mostly book + Wikipedia data. But there’s a lot of both left, in addition to scientific papers. I’m personally not sure if Twitter, Text Message, or YouTube data will be that relevant for language model intelligence.
How much of internet data is actually usable? And will it actually improve model performance? Unclear, because much of the internet is neither useful nor useable. There’s a lot of random text – menu, binary code, spam. How useful is the website of a local plumbing business, or a local middle school?
Researchers at the big foundational models say there’s enough data for 3 years of models, so maybe one or two more iterations.
Then there’s all the private data in the world. This data would likely make OpenAI a lot “smarter” and create more economic value. For example:
x
- Medical records
- Private M&A agreements
- Corporate presentations and memos
- Financial underwriting documents
- Conversations that aren’t even recorded as data!
But, in a society already hyper-sensitive about data privacy, there’s no way that OpenAI or Google easily obtains any significant quantity of private data.
So, data could be the answer, but there are significant challenges and public data will run out at some point.
3c) A better algorithm?
The most recent advance in “a better algorithm” has been reinforcement learning using human feedback (RLHF).
x
- Why we needed this: “Regular” LLMs predict the next word of any given input text, if encountered on the internet. When a model needs to follow specific instructions (e.g., write a rap song in the voice of Scooby-Doo), it needs to be “taught” how to use its knowledge to do more useful tasks instead of pure prediction.
- How does it work? You have a model generate 10 responses, have a human rank the responses from best to worst, and then use that human feedback to train a separate “reward model” to learn human preferences and adjust the base model.
The results are pretty powerful:
OpenAI trained GPT 3 to finish the next few words (unsupervised learning). So to test RLHF, OpenAI trained InstructGPT, a smaller model using human feedback, and it performs much better. In fact, it performed so well that OpenAI trained GPT 3 with RLHF, which created ChatGPT (aka GPT 3.5).
So, what are the drawbacks to using RLHF? Why won’t it scale to superintelligence?
x
- Cost: OpenAI’s InstructGPT is a 100x smaller model than GPT3 (1.3B vs. 175B parameters) and it still required 20,000 hours of human feedback. Assuming an average Philippines outsourced labor cost of $30/hr, that’s $600k. That’s 17% of GPT3’s total training cost (~$3.6M).
x Scale AI, one of OpenAI’s largest data labeling partners, is rumored to have ~$350-400M ARR, mostly all data labeling for the ~6 foundational companies. OpenAI could be spending up to ~$50M on data labeling a year. In fact, OpenAI has begun insourcing data labeling from Scale AI – it’s a big enough financial cost that they don’t want to pay the 50% margin that Scale is charging.
x
- Overfitting: Sometimes the model anchors too much on the reward model. When OpenAI over-optimized a model to focus on positive sentiment, it learned that wedding parties were overwhelmingly positive. The model’s response to any arbitrary prompt would inevitably end up describing a wedding party.
- Task evaluation: for models to be very smart, they’ll need to answer questions that are hard for humans to evaluate. For example, if a model produces an answer to a complicated scientific or mathematical question, it’ll be hard (and expensive) to find the right PhD graduate to assess if the model is correct.
It’s not hard to see why the current RLHF will have difficulties scaling to AI super intelligence.
So what’s coming down the pipe? A lot of new techniques to automate the “learning” process.
x
- Active learning: model labels it’s own training data (taking into account cost and benefit, according to a reward model)
- Self distillation: use of a larger "teacher" model to provide a training signal for a smaller "student" model.
- Recursive Reward Modeling: OpenAI is very excited about this one. The idea is to break down a complicated human evaluation task into smaller, stepwise chunks. An AI can then evaluate each smaller step, which is also easier for a human to evaluate.
First, OpenAI first explored this technique with book summarization. Evaluating the summary of a whole book is hard for people who haven’t read it. But if humans can trust AI summaries of individual chapters, the whole book summary is easier to evaluate.
Second, My friends Hunter and Bowen at OpenAI implemented this idea with the Let’s Verify Step by Step paper. They broke down complex math problems into individual steps –
It’s easier for a human to evaluate each step, and also improves the AI’s performance. So much so that it can do very complicated math:
However, once again, cost is a problem. The dataset they used contains about 800,000 step-level labels over 75,000 solutions. These humans here are more expensive too, those with advanced math education. I wouldn’t be surprised if this dataset cost several million.
The race to automate human preference for finding information is a story from the 1990s/early 2000s. We all know how that turned out.
OpenAI sees the writing on the wall as well. They’ve recently dedicated 20% of their compute to forming a Superalignment team, where they are trying to create a new model that can automatically train other models on human preferences (“alignment”). Spooky!
4) So, what does this all mean?
TLDR: The future of AI will be very expensive.
Where do we go from here? Here are some concrete hypotheses on what the future of AI looks like in the next 5 years –
1. Models will get better, but they’ll cost exponentially more. Slowing performance improvements, renewed focus on training data size vs. model size, and reliance on data labeling for RLHF all point to the training cost going up. $100M for one training run is not enough. The next wave of models will have training runs that cost $1B, $10B, and even more.
2. Private models will be better than open source models for generalized “superintelligence.” Which company is willing to dump $1B or $10B into training a model without clear ROI? Even Zuckerberg must answer to shareholders eventually.
3. Open-source models will be used, just in specific scenarios. First, when inference cost (and model size) is a concern. Second, when companies want to train on proprietary data.
4. AI capabilities (and thus possible applications) will get leapfrogged with the next biggest model. For example – one very “hot” category of AI startups in 2023 are applying semantic search or classification to a specific vertical. As generation abilities get better, those companies will need to pivot their product.
5. Data labeling will become increasingly important and expensive, unless there’s an algorithm breakthrough. To make models smarter, you’ll need increasingly expert human feedback for RLHF. Everyone’s working on how to automate human feedback, which could dramatically reduce this cost bucket.
So, then, who will actually be the big players in this version of the future? Who actually can afford to compete?
As it turns out, it’s really only 3, maybe 4 companies – Google, Meta, Microsoft, and OpenAI.
$1B total raised for a foundational language model is not cutting it these days.
Again, I do think smaller models, customized for a specific use case, will exist. They’ll likely require a well-designed product, and some way to collect lots of niche industry data. However, they just won’t be the superintelligent AI that keeps us up at night.
Thanks for reading. If you liked it and want to read the next post, subscribe here.
Sources
Model Scaling + Performance https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html
https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/
https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8
https://blog.heim.xyz/palm-training-cost/
https://dynomight.net/scaling/
RLHF
https://aligned.substack.com/p/ai-assisted-human-feedback
https://www.lesswrong.com/posts/d6DvuCKH5bSoT62DB/compendium-of-problems-with-rlhf
https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties
https://deepmindsafetyresearch.medium.com/scalable-agent-alignment-via-reward-modeling-bf4ab06dfd84
https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf
https://openai.com/research/summarizing-books
https://www.lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais#Self_distillation_
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf