Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become.

Hadoop is 100% open or free source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today's hyper-connected world where more and more data is being created every day, Hadoop's breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless. The students would get to work on a Real Life Project on Big Data Analytics and gain hands on project.

Topics to be covered in Workshop

Day 1

Module 1: What is Big Data & Why Hadoop?

  • What is Big Data? 
  • Traditional data management systems and their limitations 
  • What is Hadoop? 
  • Why is Hadoop used? 
  • The Hadoop eco-system 
  • Big data/Hadoop use cases 

Module 2. HDFS (Hadoop Distributed File System) and installing Hadoop on single node 

  • HDFS Architecture 
  • HDFS internals and use cases 
  • HDFS Daemons 
  • Files and blocks 
  • Namenode memory concerns 
  • Secondary namenode 
  • HDFS access options 
  • Installing and configuring Hadoop 
  • Hadoop daemons 
  • Basic Hadoop commands 
  • Hands-on exercise 

Day 2 

Module 3. Advanced HDFS concepts 

  • HDFS workshop 
  • How to use configuration class 
  • Using HDFS in MapReduce and programmatically 
  • HDFS permission and security 
  • Additional HDFS tasks 
  • HDFS web-interface 
  • Hands-on exercise 

Day 3

Module 4. Cloud computing overview and installing Hadoop on multiple nodes 

  • Cloud computing overview 
  • SaaS/PaaS/IaaS 
  • Characteristics of cloud computingSaaS/PaaS/IaaS 
  • Cluster configurationsSaaS/PaaS/IaaS 
  • Configuring Masters and Slaves 
  • Module 5.Introduction to MapReduce 
  • MapReduce basics 
  • Functional programming concepts 
  • List processing 
  • Mapping and reducing lists 
  • Putting them together in MapReduce 
  • Word Count example application 
  • Understanding the driver, mapper and reducer 
  • Closer look at MapReduce data flow 
  • Additional MapReduce functionality 
  • Fault tolerance 
  • Hands-on exercises

Module 6. MapReduce workshop 

  • Hands-on work on MapReduce 

Module 7. Advanced MapReduce concepts 

  • Understand combiners & partitioners 
  • Understand input and output formats 
  • Distributed cache 
  • Understanding counters 
  • Chaining, listing and killing jobs 
  • Hands-On Exercise 

Day 4

Module 8. Using Pig and Hive for data analysis 

  • Pig program structure and execution process 
  • Joins & filtering using Pig 
  • Group & co-group 
  • Schema merging and redefining functions 
  • Pig functions 
  • Understanding Hive 
  • Using Hive command line interface 
  • Data types and file formats 
  • Basic DDL operations 
  • Schema design 
  • Hands-on examples 

Module 9. Introduction to HBase, Zookeeper & Sqoop 

  • HBase overview, architecture & installation 
  • HBase admin: test 
  • HBase data access 
  • Overview of Zookeeper 
  • Sqoop overview and installation 
  • Importing and exporting data in Sqoop 
  • Hands-on exercise 

Day 5 

Module 10. Introduction to Oozie, Flume and advanced Hadoop concepts 

  • Overview of Oozie and Flume 
  • Oozie features and challenges 
  • How does Flume work 
  • Connecting Flume with HDFS 
  • YARN 
  • HDFS Federation 
  • Authentication and high availability in Hadoop 

Module 11: Introduction about Data Science

  • Introduction: What is Data Science?, Getting started with R, Exploratory Data Analysis, Review of probability and probability distributions, Bayes Rule
  • Supervised Learning,  Regression, polynomial regression, local regression, k-nearest neighbors,
  • Unsupervised Learning,  Kernel density estimation, k-means, Naive Bayes, Data and Data Scraping 
  • Classification, ranking, logistic regression 
  • Ethics, time series, advanced regression

Eligibility: Computer Science (CS), Information Technology (IT) Engineering Branch, M.Tech, MCA, BCA Students. Students entering into 2nd Year to Final Year Students can participate in this training Program. However students from any branch can participate in this training program.

Certification Policy:

  • Certificate of Merit for all the workshop participants.
  • Certificate of Coordination for the coordinators of the campus workshops

Duration: 5 Days - The duration of this workshop will be five consecutive days, with 6-7 hour session each day.

Our Clients