Big Data Training

PROGRAMME HIGHLIGHTS

Experienced Faculty

Certification

Placement Assistance

Big data is data sets that are so big and complex that traditional data-processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source

Big Data is the amount of data getting generated every split of a second world-over! Hadoop is a distributed processing technology used for Big Data analysis. Hadoop market is expanding at a significant rate, as Hadoop technology provides cost effective and quick solutions compared to traditional data analysis tools such as RDBMS. The Hadoop Market has great future prospects in trade and transportation, BFSI and retail sector. Global Hadoop market was valued at $1.5 billion in 2012, and is expected to grow at a CAGR of 58.2% during 2013 to 2020 and to reach $50.2 billion by 2020.

The major drivers for the market growth is the growing volume of structured and unstructured data, increasing demand for big data analytics and quick and affordable data processing services offered by Hadoop technology.

Intek’s Big Data and Hadoop is a custom tailored programme that opens the doors for you to enter the Big Data Era!

    Key enables for the growth of “Big-Data” are:

  • Increase of storage capacities
  • Increase of processing power
  • Availability of data.
  • Hadoop Developer
  • Hadoop Consultant
  • Technical Lead – Big Data
  • Hadoop Engineer
  • Senior Hadoop Engineer
  • Hadoop Administrator
  • Custom Hadoop Application Developer
  • Business Analyst

BIG Data Syllabus

Component 1

  • Introduction to Big Data
  • Characteristics
  • Why, How and What s of Big data
  • Existing OLTP, ETL,DWH,OLAP

Component 2

  • Introduction to Hadoop Ecosystem
  • Architecture-HDFS
  • Sharding , Distributed and Replication factor  (SDR)
  • Daemons
  • Map reduce (MRV1) and Yarn
  • Hadoop v1 and v2
  • Hadoop Data fedaration

Component 3

  • Prerequisite for Installation
  • Single node , Pseudo distributed and Multinode cluster
  • Virtual machine using Linux ubuntu/CentOS
  • Installation of hadoop in cloud (Azure/AWS)
  • Installation of Java ,ssh,eclipse
  • Installation and configuration of Hadoop,HDFS,Daemons,YARN Daemons
  • High Availability (Active and Standby)
  • Automatic and manual failover
  • Hadoop Fs shell commands
  • Writing Data to HDFS
  • Reading Data from DFS

Component 4

  • Rack awareness policy and Replica placement Strategy
  • Failure Handling
  • Namenode
  • Datanode
  • Block-Safe mode
  • Rebalancing and load optimization
  • Trouble shooting and error rectification
  • Hadoop fs shell commands-Unix and Java-Basics
  • Assessment 1

Component 5

  • Introduction to Mapreduce
  • Architecture of Map reduce
  • Execution Map reduce in YARN
  • App Master ,Resource Manager and Node manager
  • Input format , Input split and Key Value Pairs
  • class and methods of Mapreduce paradigm
  • Mapper
  • Reducer
  • Partitioner
  • Custom and Default partition
  • Shuffle and Sort
  • Combiner-Scheduler
  • App Master /manager
  • Container-Node manager

Component 6

  • Map reduce Hands on
  • word count program/ log analytics
  • Hadoop streaming in R/Python
  • Data processing Transformations
  • Map only jobs and uber jobs
  • Inverted index and searches

Component 7

  • MR Programs 2
  • Structured and Unstructured Data handling
  • optimizing using Combiner
  • Partitioner
  • Single and multiple column
  • Inverted Index
  • XML -semi structure
  • Map side joins
  • Reduce side join

Component 8

  • Introduction to Hive Data warehouse
  • Installation hive and metastore database
  • Configure metastore to mysql
  • Hive QL Commands

Component 9

  • Manipulation and anlytical function in hive
  • Managed table and external tables
  • Partitioning and Bucketing
  • Complex data types and Unstructured data
  • Advance HQL commands
  • UDF and UDAF
  • Integration with Hbase
  • SerDe / Regular Expression

Component 10

  • Introduction to PIG
  • Installation-Bags and collections
  • Commands and Scripts
  • Pig UDF

File formats:

  • JSON to AVRO file conversion
  • Parquet compressed file to uncompressed
  • AVRO schema and data file
  • ORC file
  • Assessment 2

Component 11

  • Introduction to NOSQL
  • ACID /CAP/BASE
  • Key value pair
  • Map reduce
  • Column family
  • HbaseDocumennt
  • MongoDB
  • Graph DB
  • Neo4j

Component 12

  • Introduction to HBASE and installation
  • The HBase Data Model
  • The HBase Shell
  • HBase Architecture
  • Schema Design
  • The HBase API
  • HBase Configuration and Tuning

Component 13

  • Ingest data from RDB
  • Introduction to Sqoop and installation
  • Import and export data from and to RDB
  • Bulk loading , Incremental load , Split by , Conditional query
  • Sqoop validation and jobs

Component 14

  • Ingest streaming data
  • Flume Architecture
  • Agent ,Source,sink channel
  • Ingest log file
  • Collecting data from twitter for Sentimental analysis
  • Assessment 3

Component 15

  • Integrate With ETL
  • Talend Big data edition – Components of big data
  • Big data Analytics
  • Dimensional modelling
  • Data Visualization
  • Tableau – Hive and spark sql connectors

Component 16

  • Spark core and Components
  • Spark Shell
  • Create RDD from HDFS /Local
  • Creating new RDD-Transformations on RDD
  • Lineage Graph – DAG
  • Actions on RDD
  • RDD Concepts on Persist and Cache-Lazy evaluation of RDD
  • Hands on and core concepts of map() transformation
  • Hands on and core concepts of filter() transformation
  • Hands on and core concepts of flatMap() transformation Compare map and flatMap transformation Hands on and core concepts of reduce() action
  • Hands on and core concepts of fold() action-Hands on and core concepts of aggregate() action
  • Basics of Accumulator
  • Hands on and core concepts of collect() action
  • Hands on and core concepts of take() action
  • Apache Spark Execution Model
  • How Spark execute program
  • Concepts of RDD partitioning
  • RDD data shuffling and performance issue

Component 17

  • Data frames and dataset
  • Spark SQL
  • Pyspark
  • Spark jobs
  • Build scala program using SBT /Maven
  • Spark submit and spark Application

Component 18

  • KAFKA-Publisher /Subscriber
  • Consumer and producer
  • HUE
  • Monitoring and scheduling

Component 19

  • Zeppelin
  • OOZIE-Workflow and Co-ordinator
  • Distribution Installation on cloud or Sandbox
  • Cloudera -cloudera manager
  • Horton works -ambari server
  • MapR – MCS

Component20

  • Introduction to Data science-Machine learning-Statistical Analysis-Sentiment Analysis
  • Use Multinode cluster setup-High Availability-Hadoop data federation-Commissioning and-decommissioning-Automatic and manual failover-Zookeeper failover controller
  • Use cases, Case studies and Proof of Concept-Working on different Distributions

Component21 (Certification guidance)

  • CCA Spark and Hadoop Developer Exam (CCA175)
  • CCP Data Engineer (DE575)
  • HDPCD CERTIFICATION
  • HDP CERTIFIED APACHE SPARK DEVELOPER