Skip to main content
  1. Posts/
  2. Computing History Series/

Large Language Models: The Principles Behind GPT

sun.ao
Author
sun.ao
I’m sun.ao, a programmer passionate about technology, focusing on AI and digital transformation.
Table of Contents
Computing Through the Ages - This article is part of a series.
§ : This article

June 2020, OpenAI released GPT-3.

This was a language model with 175 billion parameters.

Training it used:

  • 45TB of text data
  • Thousands of GPUs
  • Tens of millions of dollars

What can it do?

Give it an opening, and it continues an article.

Give it a question, and it gives an answer.

Give it a programming task, and it writes code.

It can even do mathematical reasoning, translate languages, simulate conversations…

People were surprised to find: If a model is big enough, it can exhibit unexpected capabilities.

What is a Language Model?
#

A language model’s task is simple: Predict the next word.

Given “The weather today is really”, predict the next word might be “good”, “nice”, “terrible”…

Seems simple, but this task requires understanding language and the world.

To predict “The weather today is really nice, let’s go ___”, the model needs to know the association between “nice weather” and “go out”.

To predict “Xiao Ming failed the exam, he is very ___”, the model needs to understand human emotions.

Language models learn the patterns of language and knowledge of the world by learning massive text.

Transformer: The Foundation of Large Models
#

Large language models are based on the Transformer architecture.

In 2017, Google published the paper “Attention Is All You Need,” proposing Transformer.

Before this, language models mainly used RNN (Recurrent Neural Networks). RNN processes text sequentially, slow and hard to parallelize.

Transformer uses Self-Attention mechanism, can process entire sequences in parallel, greatly improving efficiency.

Transformer became the standard architecture for large language models. GPT, BERT, and LLaMA are all based on Transformer.

GPT’s Evolution
#

GPT-1 (2018)

OpenAI released the first GPT model with 117 million parameters.

It was pre-trained on unlabeled text, then fine-tuned for specific tasks.

Results were good but didn’t attract much attention.

GPT-2 (2019)

Parameters increased to 1.5 billion. More training data.

OpenAI initially refused to release the full model, worried it would be used to generate fake news.

Later they changed their mind and released the full model.

GPT-2 could generate coherent long text but often went off-topic or repeated.

GPT-3 (2020)

Parameters increased to 175 billion. Training data 45TB.

GPT-3 demonstrated few-shot learning capability: just give a few examples and it can learn new tasks.

It could write articles, write code, answer questions, translate languages…

GPT-3 made people realize: Scale matters.

GPT-4 (2023)

Parameter count not disclosed, estimated at trillions.

GPT-4 is a multimodal model, can understand images and text.

It performed excellently on various exams: top 10% on simulated bar exam, top 20% on SAT math.

Scaling Laws
#

OpenAI discovered a pattern: Scaling Laws.

Model capability grows with three factors:

  • Parameter count: Bigger models, stronger capabilities
  • Data volume: More training data, stronger capabilities
  • Compute: Longer training, stronger capabilities

When these three factors grow proportionally, model capability can be predicted.

This explains why big companies compete to train bigger models.

Emergent Abilities
#

Even more interesting are Emergent Abilities.

Things small models can’t do, big models suddenly can.

For example:

  • Chain-of-thought reasoning: GPT-3 can’t, GPT-3.5 can
  • Mathematical reasoning: Small models are bad, big models suddenly get better
  • Programming ability: Small models generate garbage, big models write runnable code

This is like “phase transition” in physics: temperature rises to a critical point, water suddenly becomes steam.

Emergent abilities make large models more useful but also harder to predict.

Applications of Large Models
#

Large language models can be used for:

Content creation

  • Write articles, emails, reports
  • Create novels, poetry, scripts
  • Generate marketing copy

Programming assistant

  • Write code, debug code
  • Explain code, refactor code
  • Convert between programming languages

Knowledge Q&A

  • Answer various questions
  • Explain complex concepts
  • Provide learning suggestions

Language translation

  • Multi-language translation
  • Real-time conversation translation

Chatbots

  • Customer service bots
  • Virtual assistants
  • Role-playing

Limitations of Large Models
#

Large models also have limitations:

Hallucination

Large models confidently say wrong information. They don’t know what they don’t know.

Knowledge cutoff

Model knowledge is limited to the training data’s time. GPT-4’s knowledge cutoff is 2023.

Bias

Models may inherit bias from training data.

Cost

Training large models costs tens of millions of dollars. Running large models also needs expensive GPUs.

Safety

Large models may be used to generate fake news, cyberattacks, and other malicious purposes.

Open-Source Large Models
#

OpenAI was initially open-source but later became closed-source.

Other companies released open-source large models:

  • LLaMA: Released by Meta, widely used by open-source community
  • Mistral: Released by French company, excellent performance
  • Qwen: Released by Alibaba, strong Chinese capability
  • Yi: Released by 01.AI
  • DeepSeek: Released by DeepSeek

Open-source large models let more people use and improve large model technology.

Next Step: ChatGPT
#

In November 2022, OpenAI released ChatGPT.

This was a conversational robot based on GPT-3.5 that could naturally converse with people.

It reached 100 million users in two months, becoming the fastest-growing application in history.

ChatGPT brought large language models into the public eye.

Tomorrow, we’ll discuss the ChatGPT story.


Today’s Key Concepts
#

Large Language Model (LLM) Language models with huge parameter counts, like GPT, LLaMA, and Claude. Large language models learn massive text, mastering language understanding and generation capabilities, used for conversation, writing, programming, and other tasks.

Transformer Neural network architecture proposed in 2017, using self-attention mechanism to process sequence data. Transformer can compute in parallel, is efficient, and became the standard architecture for large language models.

Emergent Abilities New capabilities that suddenly appear in large models that small models don’t have. Capabilities like chain-of-thought reasoning and mathematical reasoning only appear after model scale reaches a critical point.


Discussion Questions
#

  1. Large language models learned language and knowledge by “predicting the next word.” Do you think this is similar to how humans learn language?
  2. Large models produce “hallucinations,” confidently saying wrong information. Do you think this problem can be solved?

Tomorrow’s Preview: The ChatGPT Moment—how did AI enter the public eye and change human-computer interaction?

Computing Through the Ages - This article is part of a series.
§ : This article

Related articles