Big Data & Hadoop
Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become.
Hadoop is 100% open or free source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today's hyper-connected world where more and more data is being created every day, Hadoop's breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless. The students would get to work on a Real Life Project on Big Data Analytics and gain hands on project.
Topics to be covered in Workshop
Day 1
Module 1: What is Big Data & Why Hadoop?
- What is Big Data?
- Traditional data management systems and their limitations
- What is Hadoop?
- Why is Hadoop used?
- The Hadoop eco-system
- Big data/Hadoop use cases
Module 2. HDFS (Hadoop Distributed File System) and installing Hadoop on single node
- HDFS Architecture
- HDFS internals and use cases
- HDFS Daemons
- Files and blocks
- Namenode memory concerns
- Secondary namenode
- HDFS access options
- Installing and configuring Hadoop
- Hadoop daemons
- Basic Hadoop commands
- Hands-on exercise
Day 2
Module 3. Advanced HDFS concepts
- HDFS workshop
- HDFS API
- How to use configuration class
- Using HDFS in MapReduce and programmatically
- HDFS permission and security
- Additional HDFS tasks
- HDFS web-interface
- Hands-on exercise
Day 3
Module 4. Cloud computing overview and installing Hadoop on multiple nodes
- Cloud computing overview
- SaaS/PaaS/IaaS
- Characteristics of cloud computingSaaS/PaaS/IaaS
- Cluster configurationsSaaS/PaaS/IaaS
- Configuring Masters and Slaves
- Module 5.Introduction to MapReduce
- MapReduce basics
- Functional programming concepts
- List processing
- Mapping and reducing lists
- Putting them together in MapReduce
- Word Count example application
- Understanding the driver, mapper and reducer
- Closer look at MapReduce data flow
- Additional MapReduce functionality
- Fault tolerance
- Hands-on exercises
Module 6. MapReduce workshop
- Hands-on work on MapReduce
Module 7. Advanced MapReduce concepts
- Understand combiners & partitioners
- Understand input and output formats
- Distributed cache
- Understanding counters
- Chaining, listing and killing jobs
- Hands-On Exercise
Day 4
Module 8. Using Pig and Hive for data analysis
- Pig program structure and execution process
- Joins & filtering using Pig
- Group & co-group
- Schema merging and redefining functions
- Pig functions
- Understanding Hive
- Using Hive command line interface
- Data types and file formats
- Basic DDL operations
- Schema design
- Hands-on examples
Module 9. Introduction to HBase, Zookeeper & Sqoop
- HBase overview, architecture & installation
- HBase admin: test
- HBase data access
- Overview of Zookeeper
- Sqoop overview and installation
- Importing and exporting data in Sqoop
- Hands-on exercise
Day 5
Module 10. Introduction to Oozie, Flume and advanced Hadoop concepts
- Overview of Oozie and Flume
- Oozie features and challenges
- How does Flume work
- Connecting Flume with HDFS
- YARN
- HDFS Federation
- Authentication and high availability in Hadoop
Module 11: Introduction about Data Science
- Introduction: What is Data Science?, Getting started with R, Exploratory Data Analysis, Review of probability and probability distributions, Bayes Rule
- Supervised Learning, Regression, polynomial regression, local regression, k-nearest neighbors,
- Unsupervised Learning, Kernel density estimation, k-means, Naive Bayes, Data and Data Scraping
- Classification, ranking, logistic regression
- Ethics, time series, advanced regression
Duration: The duration of this workshop will be five consecutive days, with 6-7 hours session per day
Certification Policy:
- Certificate of Participation for all the workshop participants.
- At the end of this workshop, a small competition will be organized among the participating students and winners will be awarded with a 'Certificate of Excellence'.
- Certificate of Coordination for the coordinators of the campus workshops.
Eligibility: There are no prerequisites. Anyone interested, can join this workshop.