Hadoop and friends

an introduction to the big data world

Created by Javi M. / @javiyt

Agenda

  • What's big data? The three Vs
  • Big Data evolution
  • What's Hadoop?
  • Hadoop Ecosystem
  • Spark
  • Real Time
  • More to come

What's big data?

The three Vs

Big Data evolution

What's Hadoop?

  • Framework for storage and large-scale processing data
  • Four modules/layers
    • Hadoop Common
    • HDFS
    • MapReduce
    • YARN

What's Hadoop?

Hadoop Common

  • Basic Java scripts to start Hadoop
  • Require JRE 1.6
  • Manage all the other components

What's Hadoop?

HDFS

  • Distributed file system through a cluster
  • Can store files from GB to TB
  • Efficiency when reading
  • Files are accessed using Java API (Thrift)

What's Hadoop?

MapReduce

What's Hadoop?

MapReduce

  • JobTracker + TaskTracker
  • Separate JVM processess
  • Not load balanced, nearest data node

What's Hadoop?

YARN

  • Resource manager
  • Responsible of distributing user's apps or MR between nodes

What's Hadoop?

Hadoop Ecosystem

Hive

  • Database using Hadoop as store
  • HiveQL
  • Can handle compressed data: gzip, bzip2, LZO, ...
  • RDBMS for storing metadata

Hadoop Ecosystem

Impala

  • Run SQL queries on top HDFS/Hbase
  • Support Hadoop file formats: Parquet, LZO, ..
  • Faster than Hive
  • Share metastore with Hive

Hadoop Ecosystem

HBase

  • NoSQL database
  • Columnar storage
  • HDFS storage layer
  • REST, Avro, Thrift

Hadoop Ecosystem

Oozie

  • Workflow scheduler
  • Can execute:
    • MapReduce operations
    • HDFS operations
  • XML

Hadoop Ecosystem

Sqoop

  • Connect relational databases to Hadoop
  • Import/export to/from:
    • RDBMS: MySQL, PostgreSQL, SQL Server, Oracle, ...
    • Hive
    • HBase

Hadoop Ecosystem

Pig

  • High level platform to create MR
  • Pig Latin
  • ETL
  • Store data at any point while executing

Hadoop Ecosystem

HUE

  • Web interface to access Hadoop resources
  • It contains:
    • HDFS browser
    • Hive query editor
    • Impala query editor
    • Oozie workflows manager
    • Pig editor
    • ...

Hadoop Ecosystem

HUE

Spark

  • Cluster computing framework
  • Cluster management support:
    • Spark
    • YARN
    • Apache Mesos
  • Storage:
    • HDFS
    • Cassandra
    • S3
    • ...

Spark

  • Spark SQL
  • Spark Streaming
  • MLlib
  • GraphX

Real Time

Apache Storm

  • Process data in real time
  • Fault-tolerant
  • No data loss
  • Topology: spouts/bolts

Real Time

Apache Storm

  • Nimbus
  • Supervisor
  • Zookeeper
  • UI

Real Time

Heron

  • Storm API compatible
  • Faster than Storm

More to come

Lambda Architecture

  • Three layers
    • Batch layer
    • Speed layer
    • Serving layer

More to come

Lambda Architecture

Questions