Large Language Models: The Principles Behind GPT

Computing Through the Ages - This article is part of a series.

§ : This article

§ : The AI Wave: Technological Changes After 2016

§ : The Deep Learning Explosion: AlphaGo Defeats Lee Sedol

§ : Introduction to Machine Learning: Teaching Computers to Learn

§ : The Big Data Era: Data is the New Oil

§ : Android's Rise: The Open-Source System's Counterattack

§ : Mobile Internet: The iPhone Redefined the Phone

§ : The Cloud Computing Era: AWS Changed Everything

§ : The Power of Open Source: Linux and the Open Source Movement

§ : Python's Philosophy: Why Python is Called Python

§ : Programming Language Evolution: From Machine Code to High-Level Languages

§ : The Browser Wars: Netscape vs. IE, the Battle of the Century

§ : The Birth of the World Wide Web: Tim Berners-Lee's Gift

§ : Bill Gates and Microsoft: The Rise of a Software Empire

§ : Steve Jobs and Apple: Product Launches That Changed the World

§ : The Personal Computer Revolution: Apple in the Garage

§ : The Origins of the Internet: ARPANET and Cold War Rivalry

§ : The Origins of Operating Systems: The Birth of UNIX

§ : Moore's Law: The Amazing Law Predicting the Future

§ : The Miracle of Chips: How Integrated Circuits Changed the World

§ : The Transistor Revolution: From Vacuum Tubes to Solid-State Electronics

§ : The First Electronic Computer: The Birth of ENIAC

§ : von Neumann's Blueprint: The Stored-Program Computer

§ : Turing's Brain: The Theoretical Ancestor of Computers

§ : The Legend of the Difference Engine: Babbage's Unfinished Dream

§ : The Dawn of Mechanical Calculation: Pascal and Leibniz

§ : Humanity's Dream of Calculation: From Knots to the Abacus

June 2020, OpenAI released GPT-3.

This was a language model with 175 billion parameters.

Training it used:

45TB of text data
Thousands of GPUs
Tens of millions of dollars

What can it do?

Give it an opening, and it continues an article.

Give it a question, and it gives an answer.

Give it a programming task, and it writes code.

It can even do mathematical reasoning, translate languages, simulate conversations…

People were surprised to find: If a model is big enough, it can exhibit unexpected capabilities.

What is a Language Model?
#

A language model’s task is simple: Predict the next word.

Given “The weather today is really”, predict the next word might be “good”, “nice”, “terrible”…

Seems simple, but this task requires understanding language and the world.

To predict “The weather today is really nice, let’s go ___”, the model needs to know the association between “nice weather” and “go out”.

To predict “Xiao Ming failed the exam, he is very ___”, the model needs to understand human emotions.

Language models learn the patterns of language and knowledge of the world by learning massive text.

Transformer: The Foundation of Large Models
#

Large language models are based on the Transformer architecture.

In 2017, Google published the paper “Attention Is All You Need,” proposing Transformer.

Before this, language models mainly used RNN (Recurrent Neural Networks). RNN processes text sequentially, slow and hard to parallelize.

Transformer uses Self-Attention mechanism, can process entire sequences in parallel, greatly improving efficiency.

Transformer became the standard architecture for large language models. GPT, BERT, and LLaMA are all based on Transformer.

GPT’s Evolution
#

GPT-1 (2018)

OpenAI released the first GPT model with 117 million parameters.

It was pre-trained on unlabeled text, then fine-tuned for specific tasks.

Results were good but didn’t attract much attention.

GPT-2 (2019)

Parameters increased to 1.5 billion. More training data.

OpenAI initially refused to release the full model, worried it would be used to generate fake news.

Later they changed their mind and released the full model.

GPT-2 could generate coherent long text but often went off-topic or repeated.

GPT-3 (2020)

Parameters increased to 175 billion. Training data 45TB.

GPT-3 demonstrated few-shot learning capability: just give a few examples and it can learn new tasks.

It could write articles, write code, answer questions, translate languages…

GPT-3 made people realize: Scale matters.

GPT-4 (2023)

Parameter count not disclosed, estimated at trillions.

GPT-4 is a multimodal model, can understand images and text.

It performed excellently on various exams: top 10% on simulated bar exam, top 20% on SAT math.

Scaling Laws
#

OpenAI discovered a pattern: Scaling Laws.

Model capability grows with three factors:

Parameter count: Bigger models, stronger capabilities
Data volume: More training data, stronger capabilities
Compute: Longer training, stronger capabilities

When these three factors grow proportionally, model capability can be predicted.

This explains why big companies compete to train bigger models.

Emergent Abilities
#

Even more interesting are Emergent Abilities.

Things small models can’t do, big models suddenly can.

For example:

Chain-of-thought reasoning: GPT-3 can’t, GPT-3.5 can
Mathematical reasoning: Small models are bad, big models suddenly get better
Programming ability: Small models generate garbage, big models write runnable code

This is like “phase transition” in physics: temperature rises to a critical point, water suddenly becomes steam.

Emergent abilities make large models more useful but also harder to predict.

Applications of Large Models
#

Large language models can be used for:

Content creation

Write articles, emails, reports
Create novels, poetry, scripts
Generate marketing copy

Programming assistant

Write code, debug code
Explain code, refactor code
Convert between programming languages

Knowledge Q&A

Answer various questions
Explain complex concepts
Provide learning suggestions

Language translation

Multi-language translation
Real-time conversation translation

Chatbots

Customer service bots
Virtual assistants
Role-playing

Limitations of Large Models
#

Large models also have limitations:

Hallucination

Large models confidently say wrong information. They don’t know what they don’t know.

Knowledge cutoff

Model knowledge is limited to the training data’s time. GPT-4’s knowledge cutoff is 2023.

Bias

Models may inherit bias from training data.

Cost

Training large models costs tens of millions of dollars. Running large models also needs expensive GPUs.

Safety

Large models may be used to generate fake news, cyberattacks, and other malicious purposes.

Open-Source Large Models
#

OpenAI was initially open-source but later became closed-source.

Other companies released open-source large models:

LLaMA: Released by Meta, widely used by open-source community
Mistral: Released by French company, excellent performance
Qwen: Released by Alibaba, strong Chinese capability
Yi: Released by 01.AI
DeepSeek: Released by DeepSeek

Open-source large models let more people use and improve large model technology.

Next Step: ChatGPT
#

In November 2022, OpenAI released ChatGPT.

This was a conversational robot based on GPT-3.5 that could naturally converse with people.

It reached 100 million users in two months, becoming the fastest-growing application in history.

ChatGPT brought large language models into the public eye.

Tomorrow, we’ll discuss the ChatGPT story.

Today’s Key Concepts
#

Large Language Model (LLM) Language models with huge parameter counts, like GPT, LLaMA, and Claude. Large language models learn massive text, mastering language understanding and generation capabilities, used for conversation, writing, programming, and other tasks.

Transformer Neural network architecture proposed in 2017, using self-attention mechanism to process sequence data. Transformer can compute in parallel, is efficient, and became the standard architecture for large language models.

Emergent Abilities New capabilities that suddenly appear in large models that small models don’t have. Capabilities like chain-of-thought reasoning and mathematical reasoning only appear after model scale reaches a critical point.

Discussion Questions
#

Large language models learned language and knowledge by “predicting the next word.” Do you think this is similar to how humans learn language?
Large models produce “hallucinations,” confidently saying wrong information. Do you think this problem can be solved?

Tomorrow’s Preview: The ChatGPT Moment—how did AI enter the public eye and change human-computer interaction?