The Big Data Era: Data is the New Oil

Computing Through the Ages - This article is part of a series.

§ : This article

§ : Android's Rise: The Open-Source System's Counterattack

§ : Mobile Internet: The iPhone Redefined the Phone

§ : The Cloud Computing Era: AWS Changed Everything

§ : The Power of Open Source: Linux and the Open Source Movement

§ : Python's Philosophy: Why Python is Called Python

§ : Programming Language Evolution: From Machine Code to High-Level Languages

§ : The Browser Wars: Netscape vs. IE, the Battle of the Century

§ : The Birth of the World Wide Web: Tim Berners-Lee's Gift

§ : Bill Gates and Microsoft: The Rise of a Software Empire

§ : Steve Jobs and Apple: Product Launches That Changed the World

§ : The Personal Computer Revolution: Apple in the Garage

§ : The Origins of the Internet: ARPANET and Cold War Rivalry

§ : The Origins of Operating Systems: The Birth of UNIX

§ : Moore's Law: The Amazing Law Predicting the Future

§ : The Miracle of Chips: How Integrated Circuits Changed the World

§ : The Transistor Revolution: From Vacuum Tubes to Solid-State Electronics

§ : The First Electronic Computer: The Birth of ENIAC

§ : von Neumann's Blueprint: The Stored-Program Computer

§ : Turing's Brain: The Theoretical Ancestor of Computers

§ : The Legend of the Difference Engine: Babbage's Unfinished Dream

§ : The Dawn of Mechanical Calculation: Pascal and Leibniz

§ : Humanity's Dream of Calculation: From Knots to the Abacus

In 2017, The Economist magazine’s cover headline was:

“The world’s most valuable resource is no longer oil, but data.”

This statement revealed the truth of our era.

Oil needs drilling, refining, and processing. Data also needs collection, cleaning, and analysis.

But data differs from oil:

Data can be infinitely copied, never depleted
Data becomes more valuable the more it’s used, not less
Data can be combined to produce new value

Mobile internet, social media, IoT… every moment generates data.

What happens on the internet every minute?

Google processes 3.8 million searches
YouTube uploads 500 hours of video
Facebook users post 500,000 comments
Instagram users upload 50,000 photos
WeChat sends over 100 million messages

This data is the oil of the new era.

The Data Explosion
#

Humanity’s data has always been growing. But the growth rate is accelerating:

1980s: Global data about a few GB

1990s: Internet appeared, data grew to TB level

2000s: Social media emerged, data grew to PB level

2010s: Mobile internet spread, data grew to EB level

2020s: IoT and AI emerged, data grew to ZB level

1 ZB = 1 billion TB. Today, the world generates over 100 ZB of data annually.

Characteristics of Big Data
#

Big data has four Vs:

Volume: Massive amounts of data, traditional tools can’t process

Velocity: Data generated quickly, needs real-time processing

Variety: Diverse data types: text, images, video, sensor data…

Veracity: Data quality varies, needs cleaning and verification

Later a fifth V was added:

Value: Data itself has no value; only analysis produces value

Hadoop: The Tool for Processing Big Data
#

Facing massive data, traditional databases couldn’t cope.

In 2004, Google published two papers describing their distributed systems: MapReduce and GFS (Google File System).

Doug Cutting was inspired and developed Hadoop—an open-source distributed computing framework.

Hadoop includes:

HDFS: Distributed file system, storing data across multiple machines

MapReduce: Distributed computing framework, distributing computing tasks across multiple machines

Hadoop let ordinary companies process massive data without building expensive data centers like Google.

Later, Spark replaced MapReduce with faster speed. Flink supports stream processing for real-time data analysis.

The Value of Data
#

What is data used for?

Business decisions

Amazon analyzes user purchase records to recommend related products. Recommendation systems contribute 35% of sales.

Netflix analyzes user viewing records to recommend content and even decide what shows to produce.

Precision advertising

Google and Facebook analyze user behavior to deliver targeted ads. This is their main revenue source.

Risk control

Banks analyze transaction data to detect fraud. Insurance companies analyze customer data to customize premiums.

Healthcare

Analyzing medical data to predict disease outbreaks. Analyzing genetic data for personalized treatment.

Urban management

Analyzing traffic data to optimize traffic lights. Analyzing energy data for smart scheduling.

Scientific research

Analyzing astronomical data to discover new celestial bodies. Analyzing physics experiment data to verify theories.

Data Collection
#

Where does data come from?

User behavior: Clicks, browsing, purchases, searches…

Sensors: Temperature, location, speed, images…

Social media: Posts, comments, likes, shares…

Transaction records: Purchases, payments, transfers…

Device logs: Server logs, application logs…

Every moment, data is being generated. Collecting data is no longer the problem; the problem is how to use it.

Data Privacy
#

But data collection brings problems: Privacy.

Tech companies collect our:

Location information
Search history
Purchase history
Social relationships
Even conversation content

This data can be used to:

Deliver targeted ads
Influence our decisions
Even manipulate elections (Cambridge Analytica incident)

In 2018, the EU implemented GDPR (General Data Protection Regulation), regulating how companies collect and process personal data. Violators face huge fines.

China also implemented the Personal Information Protection Law to protect citizens’ data rights.

Data Monopoly
#

Data also has monopoly issues.

Big companies have more data, can train better AI models, provide better services, attract more users, generate more data—positive feedback loop.

Small companies can’t compete.

This led to data’s “winner-takes-all”: Google monopolizes search data, Facebook monopolizes social data, Amazon monopolizes shopping data.

Regulators are starting to pay attention to data monopoly issues, but solutions are still being explored.

The Future of Data
#

What is the future of data?

More data: IoT will connect tens of billions of devices, generating more data.

Real-time processing: 5G and edge computing will enable real-time data processing.

AI analysis: Machine learning can discover patterns in data that humans can’t find.

Privacy protection: Technologies like differential privacy and federated learning can use data while protecting privacy.

Data trading: Data markets let companies buy and sell data, unlocking data value.

Next Step: Machine Learning
#

Data itself has no value; only analysis produces value.

Traditional data analysis relied on humans. But data volumes are too large for humans to process.

Machine learning lets computers learn from data and automatically discover patterns.

Tomorrow, we’ll discuss machine learning.

Today’s Key Concepts
#

Big Data Massive, high-velocity, diverse data collections that traditional tools can’t process. Big data characteristics are 4V: Volume, Velocity, Variety, Veracity.

Distributed Computing Distributing computing tasks across multiple machines for parallel execution. Hadoop and Spark are distributed computing frameworks that can process massive data.

Data Privacy Issues around protecting personal data. Tech companies collect massive personal data, raising privacy concerns. Regulations like GDPR protect personal data rights.

Discussion Questions
#

“Data is the new oil”—do you agree? What are the similarities and differences between data and oil?
Tech companies collect a lot of our data. How do you think we should balance data use and privacy protection?

Tomorrow’s Preview: Introduction to Machine Learning—how to teach computers to learn by themselves?

Computing Through the Ages - This article is part of a series.

§ : The Future is Here: The Next 50 Years of Computing

§ : AI Programming Assistants: Copilot Changes Programmers

§ : The ChatGPT Moment: AI Enters the Public Eye

§ : Large Language Models: The Principles Behind GPT

§ : The AI Wave: Technological Changes After 2016

§ : The Deep Learning Explosion: AlphaGo Defeats Lee Sedol

§ : Introduction to Machine Learning: Teaching Computers to Learn

§ : This article

§ : Android's Rise: The Open-Source System's Counterattack

§ : Mobile Internet: The iPhone Redefined the Phone