Hadoop and friends
an introduction to the big data world
Created by Javi M. / @javiyt
Agenda
- What's big data? The three Vs
- Big Data evolution
- What's Hadoop?
- Hadoop Ecosystem
- Spark
- Real Time
- More to come
What's big data?
The three Vs
Big Data evolution
What's Hadoop?
- Framework for storage and large-scale processing data
-
Four modules/layers
- Hadoop Common
- HDFS
- MapReduce
- YARN
What's Hadoop?
Hadoop Common
- Basic Java scripts to start Hadoop
- Require JRE 1.6
- Manage all the other components
What's Hadoop?
HDFS
- Distributed file system through a cluster
- Can store files from GB to TB
- Efficiency when reading
- Files are accessed using Java API (Thrift)
What's Hadoop?
MapReduce
What's Hadoop?
MapReduce
- JobTracker + TaskTracker
- Separate JVM processess
- Not load balanced, nearest data node
What's Hadoop?
YARN
- Resource manager
- Responsible of distributing user's apps or MR between nodes
What's Hadoop?
Hadoop Ecosystem
Hive
- Database using Hadoop as store
- HiveQL
- Can handle compressed data: gzip, bzip2, LZO, ...
- RDBMS for storing metadata
Hadoop Ecosystem
Impala
- Run SQL queries on top HDFS/Hbase
- Support Hadoop file formats: Parquet, LZO, ..
- Faster than Hive
- Share metastore with Hive
Hadoop Ecosystem
HBase
- NoSQL database
- Columnar storage
- HDFS storage layer
- REST, Avro, Thrift
Hadoop Ecosystem
Oozie
- Workflow scheduler
-
Can execute:
- MapReduce operations
- HDFS operations
- …
- XML
Hadoop Ecosystem
Sqoop
- Connect relational databases to Hadoop
-
Import/export to/from:
- RDBMS: MySQL, PostgreSQL, SQL Server, Oracle, ...
- Hive
- HBase
Hadoop Ecosystem
Pig
- High level platform to create MR
- Pig Latin
- ETL
- Store data at any point while executing
Hadoop Ecosystem
HUE
- Web interface to access Hadoop resources
-
It contains:
- HDFS browser
- Hive query editor
- Impala query editor
- Oozie workflows manager
- Pig editor
- ...
Hadoop Ecosystem
HUE
Spark
- Cluster computing framework
-
Cluster management support:
-
Storage:
Spark
- Spark SQL
- Spark Streaming
- MLlib
- GraphX
Real Time
Apache Storm
- Process data in real time
- Fault-tolerant
- No data loss
- Topology: spouts/bolts
Real Time
Apache Storm
- Nimbus
- Supervisor
- Zookeeper
- UI
Real Time
Heron
- Storm API compatible
- Faster than Storm
More to come
Lambda Architecture
-
Three layers
- Batch layer
- Speed layer
- Serving layer
More to come
Lambda Architecture
Questions