In 2017, The Economist magazine’s cover headline was:
“The world’s most valuable resource is no longer oil, but data.”
This statement revealed the truth of our era.
Oil needs drilling, refining, and processing. Data also needs collection, cleaning, and analysis.
But data differs from oil:
- Data can be infinitely copied, never depleted
- Data becomes more valuable the more it’s used, not less
- Data can be combined to produce new value
Mobile internet, social media, IoT… every moment generates data.
What happens on the internet every minute?
- Google processes 3.8 million searches
- YouTube uploads 500 hours of video
- Facebook users post 500,000 comments
- Instagram users upload 50,000 photos
- WeChat sends over 100 million messages
This data is the oil of the new era.
The Data Explosion#
Humanity’s data has always been growing. But the growth rate is accelerating:
1980s: Global data about a few GB
1990s: Internet appeared, data grew to TB level
2000s: Social media emerged, data grew to PB level
2010s: Mobile internet spread, data grew to EB level
2020s: IoT and AI emerged, data grew to ZB level
1 ZB = 1 billion TB. Today, the world generates over 100 ZB of data annually.
Characteristics of Big Data#
Big data has four Vs:
Volume: Massive amounts of data, traditional tools can’t process
Velocity: Data generated quickly, needs real-time processing
Variety: Diverse data types: text, images, video, sensor data…
Veracity: Data quality varies, needs cleaning and verification
Later a fifth V was added:
Value: Data itself has no value; only analysis produces value
Hadoop: The Tool for Processing Big Data#
Facing massive data, traditional databases couldn’t cope.
In 2004, Google published two papers describing their distributed systems: MapReduce and GFS (Google File System).
Doug Cutting was inspired and developed Hadoop—an open-source distributed computing framework.
Hadoop includes:
HDFS: Distributed file system, storing data across multiple machines
MapReduce: Distributed computing framework, distributing computing tasks across multiple machines
Hadoop let ordinary companies process massive data without building expensive data centers like Google.
Later, Spark replaced MapReduce with faster speed. Flink supports stream processing for real-time data analysis.
The Value of Data#
What is data used for?
Business decisions
Amazon analyzes user purchase records to recommend related products. Recommendation systems contribute 35% of sales.
Netflix analyzes user viewing records to recommend content and even decide what shows to produce.
Precision advertising
Google and Facebook analyze user behavior to deliver targeted ads. This is their main revenue source.
Risk control
Banks analyze transaction data to detect fraud. Insurance companies analyze customer data to customize premiums.
Healthcare
Analyzing medical data to predict disease outbreaks. Analyzing genetic data for personalized treatment.
Urban management
Analyzing traffic data to optimize traffic lights. Analyzing energy data for smart scheduling.
Scientific research
Analyzing astronomical data to discover new celestial bodies. Analyzing physics experiment data to verify theories.
Data Collection#
Where does data come from?
User behavior: Clicks, browsing, purchases, searches…
Sensors: Temperature, location, speed, images…
Social media: Posts, comments, likes, shares…
Transaction records: Purchases, payments, transfers…
Device logs: Server logs, application logs…
Every moment, data is being generated. Collecting data is no longer the problem; the problem is how to use it.
Data Privacy#
But data collection brings problems: Privacy.
Tech companies collect our:
- Location information
- Search history
- Purchase history
- Social relationships
- Even conversation content
This data can be used to:
- Deliver targeted ads
- Influence our decisions
- Even manipulate elections (Cambridge Analytica incident)
In 2018, the EU implemented GDPR (General Data Protection Regulation), regulating how companies collect and process personal data. Violators face huge fines.
China also implemented the Personal Information Protection Law to protect citizens’ data rights.
Data Monopoly#
Data also has monopoly issues.
Big companies have more data, can train better AI models, provide better services, attract more users, generate more data—positive feedback loop.
Small companies can’t compete.
This led to data’s “winner-takes-all”: Google monopolizes search data, Facebook monopolizes social data, Amazon monopolizes shopping data.
Regulators are starting to pay attention to data monopoly issues, but solutions are still being explored.
The Future of Data#
What is the future of data?
More data: IoT will connect tens of billions of devices, generating more data.
Real-time processing: 5G and edge computing will enable real-time data processing.
AI analysis: Machine learning can discover patterns in data that humans can’t find.
Privacy protection: Technologies like differential privacy and federated learning can use data while protecting privacy.
Data trading: Data markets let companies buy and sell data, unlocking data value.
Next Step: Machine Learning#
Data itself has no value; only analysis produces value.
Traditional data analysis relied on humans. But data volumes are too large for humans to process.
Machine learning lets computers learn from data and automatically discover patterns.
Tomorrow, we’ll discuss machine learning.
Today’s Key Concepts#
Big Data Massive, high-velocity, diverse data collections that traditional tools can’t process. Big data characteristics are 4V: Volume, Velocity, Variety, Veracity.
Distributed Computing Distributing computing tasks across multiple machines for parallel execution. Hadoop and Spark are distributed computing frameworks that can process massive data.
Data Privacy Issues around protecting personal data. Tech companies collect massive personal data, raising privacy concerns. Regulations like GDPR protect personal data rights.
Discussion Questions#
- “Data is the new oil”—do you agree? What are the similarities and differences between data and oil?
- Tech companies collect a lot of our data. How do you think we should balance data use and privacy protection?
Tomorrow’s Preview: Introduction to Machine Learning—how to teach computers to learn by themselves?
