HADOOP - BIGDATA

Hadoop Course Content

Apache Hadoop

The Apache Hadoop software package library could be a framework that permits for the distributed process of enormous knowledge sets across clusters of computers victimization straightforward programming models. a large style of firms and organizations use Hadoop for each analysis and production. It provides a software package framework for distributed storage and process of huge knowledge victimization the MapReduce programming model.

The core of Apache Hadoop consists of a storage half, referred to as Hadoop Distributed file system (HDFS), and a process half that could be a MapReduce programming model. This approach takes advantage of knowledge neck of the woods, wherever nodes manipulate the information they need access to. this permits the knowledgeset to be processed quicker and additional with efficiency than it might be in an exceedingly additional typical mainframe design that depends on a parallel filing system wherever computation and data area unit distributed via high-speed networking.

Module 1:

Introduction to Big Data and Hadoop

Components of Hadoop and Hadoop Architecture

HDFS, Map Reduce & Yarn Deep Dive

Installation & Configuration of Hadoop in a VM(Single Node)

Multinode Installation(3 Nodes)

  • On Premise in Local Machines
  • Cloud

Performance tuning, Advanced administration activities, Monitoring the Hadoop Cluster

  • Hadoop Bench Marking(Teragen & Terasort on 10 GB Data)
  • Hadoop Web UI monitoring
  • Advanced Hadoop Administration commands from Cli
  • Tuning the Hadoop cluster by tweaking the Performance tuning Parameters for HDFS & MapReduce framework
  • Node Commissioning(addition) and Decomissioning(Removing)
  • Running Balancer to redistribute the Data in Hadoop

Writing MapReduce programs in Java: Wordcount

  • Webserver Log Analysis
  • Recommendation Engine(Product Recommendation generator)
  • Sentiment Analysis
  • Custom Record Readers, Partitioners, Combiners
  • Distributed Copy

Introduction and learning to Pig, Pig Latin: Installation & Wordcount

  • Webserver Log analysis
  • Sentiment Analysis
  • Processing JSON data in Pig using Elephant Bird library
  • Advanced Pig processing using Piggybank Library
  • Building Pig UDFs and calling from Pig scripts

Advanced Pig Concepts

  • Performance Tuning parameters
  • Controlling parallelism
  • Running Pig Scripts on Tez

Introduction and learning to Hive: Installation & Wordcount

  • Webserver Log analysis
  • Sentiment Analysis
  • Recommendation Engine in Hive(Product Based Recomendation)
  • Hive Performance Tuning Parameters
  • Loading CSV data, JSON data, etc in Hive
  • Hive File Formats including Text, ORC, Parquet

Introduction and learning to Sqoop

  • Advanced Sqoop Import export options using Queries
  • Controlling Parallelism

Introduction to Hbase, Installation and HBase Queries

Zookeeper for Coordination, Hbase Multinode installation with Zookeeper

Cloudera and Hortonworks Distribution of Hadoop

Deploying a Multinode Hadoop Cluster using Ambari

Workflow Scheduling using Oozie for Automation

Module 2:

Other Components of the Hadoop ecosystem

Flume for Relatime data collection

Kafka for Realtime Log analysis: Log Filtering

Spark for Realtime In memory Analytics

Advanced Spark Concepts, Spark Programming APIs, Spark RDDs

  1. Spark Controlling Parallelism, Partitions & Persistence
  2. Spark SQL
  3. Spark Streaming

Scala Programming Basics to Advanced

Python Introduction & Python Spark programming using PySpark

Spark for Realtime Log analysis: Analytics

  1. Creating and Deploying End-to-End Web Log Analysis Solution
  2. Realtime Log collection using Flume
  3. Filtering the Logs in Kafka
  4. Realtime Threat detection in Spark using Logs from Kafka Stream
  5. Click Stream analysis using Spark

Hadoop MR2 deployment(Yarn) Integration with Spark

Spark Machine Learning concepts and Lambda Architecture

Machine Learning using ML Lib

Customer Churn Modeling using Spark ML Lib

Zeppelin for Data Visualization, Spark Programming in Zeppelin using iPython Notebooks

Case studies & POC – Run Hadoop on a Medium size dataset(~5GB Data), POC can be on relatime project from your company or Duratech’s Live project

Course conclusion

Final Steps:

  1. Project evaluation and exit Test
  2. Profile Building to realign your profile Suitable for Bigdata Industry

Placement assistance & Interview handling support

Our Services

Request an Appointment



Course Mode (required)
Online CourseOffline Course