Introduction To Data Science
- What is Data Science?
- Why Python for data science?
- Relevance in industry and need of the hour
- How leading companies are harnessing the power of Data Science with Python?
- Different phases of a typical Analytics/Data Science projects and role of python
- Anaconda vs. Python
Python Essentials (Core)
- Overview of Python- Starting with Python
- Introduction to installation of Python
- Introduction to Python Editors & IDE’s(Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
- Understand Jupyter notebook & Customize Settings
- Concept of Packages/Libraries – Important packages(NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
- Installing & loading Packages & Name Spaces
- Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
- List and Dictionary Comprehensions
- Variable & Value Labels – Date & Time Values
- Basic Operations – Mathematical – string – date
- Reading and writing data
- Simple plotting
- Control flow & conditional statements
- Debugging & Code profiling
- How to create class and modules and how to call them?
- Scientific distributions used in python for Data Science – Numpy, scify, pandas, scikitlearn, statmodels, nltk etc
Accessing/Importing And Exporting Data Using Python Modules
- Importing Data from various sources (Csv, txt, excel, access etc)
- Database Input (Connecting to database)
- Viewing Data objects – subsetting, methods
- Exporting Data to various formats
- Important python modules: Pandas, beautifulsoup
Data Manipulation – Cleansing – Munging Using Python Modules
- Cleansing Data with Python
- Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversions, renaming, formatting etc)
- Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
- Python Built-in Functions (Text, numeric, date, utility functions)
- Python User Defined Functions
- Stripping out extraneous information
- Normalizing data
- Formatting data
- Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)
Data Analysis – Visualization Using Python
- Introduction exploratory data analysis
- Descriptive statistics, Frequency Tables and summarization
- Univariate Analysis (Distribution of data & Graphical Analysis)
- Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
- Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
- Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats etc)
Basic Statistics & Implementation Of Stats Methods In Python
- Basic Statistics – Measures of Central Tendencies and Variance
- Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
- Inferential Statistics -Sampling – Concept of Hypothesis Testing
- Statistical Methods – Z/t-tests (One sample, independent, paired), Anova, Correlation and Chi-square
- Important modules for statistical methods: Numpy, Scipy, Pandas
Python: Machine Learning -Predictive Modeling – Basics
- Introduction to Machine Learning & Predictive Modeling
- Types of Business problems – Mapping of Techniques – Regression vs. classification vs. segmentation vs. Forecasting
- Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
- Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)
- Overfitting (Bias-Variance Trade off) & Performance Metrics
- Feature engineering & dimension reduction
- Concept of optimization & cost function
- Concept of gradient descent algorithm
- Concept of Cross validation(Bootstrapping, K-Fold validation etc)
- Model performance metrics (R-square, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics)
Machine Learning Algorithms & Applications – Implementation In Python
- Linear & Logistic Regression
- Segmentation – Cluster Analysis (K-Means)
- Decision Trees (CART/CD 5.0)
- Ensemble Learning (Random Forest, Bagging & boosting)
- Artificial Neural Networks(ANN)
- Support Vector Machines(SVM)
- Other Techniques (KNN, Naïve Bayes, PCA)
- Introduction to Text Mining using NLTK
- Introduction to Time Series Forecasting (Decomposition & ARIMA)
- Important python modules for Machine Learning (SciKit Learn, stats models, scipy, nltk etc)
- Fine tuning the models using Hyper parameters, grid search, piping etc.
Project – Consolidate Learnings
- Applying different algorithms to solve the business problems and bench mark the results
Introduction To Big Data
- Introduction and Relevance
- Uses of Big Data analytics in various industries like Telecom, E- commerce, Finance and Insurance etc.
- Problems with Traditional Large-Scale Systems
Hadoop(Big Data) Eco-System
- Motivation for Hadoop
- Different types of projects by Apache
- Role of projects in the Hadoop Ecosystem
- Key technology foundations required for Big Data
- Limitations and Solutions of existing Data Analytics Architecture
- Comparison of traditional data management systems with Big Data management systems
- Evaluate key framework requirements for Big Data analytics
- Hadoop Ecosystem & Hadoop 2.x core components
- Explain the relevance of real-time data
- Explain how to use Big Data and real-time data as a Business planning tool
Hadoop Cluster-Architecture-Configuration Files
- Hadoop Master-Slave Architecture
- The Hadoop Distributed File System – Concept of data storage
- Explain different types of cluster setups(Fully distributed/Pseudo etc)
- Hadoop cluster set up – Installation
- Hadoop 2.x Cluster Architecture
- A Typical enterprise cluster – Hadoop Cluster Modes
- Understanding cluster management tools like Cloudera manager/Apache ambari
Hadoop-HDFS & MapReduce (YARN)
- HDFS Overview & Data storage in HDFS
- Get the data into Hadoop from local machine(Data Loading Techniques) – vice versa
- Map Reduce Overview (Traditional way Vs. MapReduce way)
- Concept of Mapper & Reducer
- Understanding MapReduce program Framework
- Develop MapReduce Program using Java (Basic)
- Develop MapReduce program with streaming API) (Basic)
Data Integration Using Sqoop & Flume
- Integrating Hadoop into an Existing Enterprise
- Loading Data from an RDBMS into HDFS by Using Sqoop
- Managing Real-Time Data Using Flume
- Accessing HDFS from Legacy Systems
Data Analysis Using Pig
- Introduction to Data Analysis Tools
- Apache PIG – MapReduce Vs Pig, Pig Use Cases
- PIG’s Data Model
- PIG Streaming
- Pig Latin Program & Execution
- Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
- Writing JAVA UDF’s
- Embedded PIG in JAVA
- PIG Macros
- Parameter Substitution
- Use Pig to automate the design and implementation of MapReduce applications
- Use Pig to apply structure to unstructured Big Data
Data Analysis Using Hive
- Apache Hive – Hive Vs. PIG – Hive Use Cases
- Discuss the Hive data storage principle
- Explain the File formats and Records formats supported by the Hive environment
- Perform operations with data in Hive
- Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
- Hive Script, Hive UDF
- Hive Persistence formats
- Loading data in Hive – Methods
- Serialization & Deserialization
- Handling Text data using Hive
- Integrating external BI tools with Hadoop Hive
Data Analysis Using Impala
- Impala & Architecture
- How Impala executes Queries and its importance
- Hive vs. PIG vs. Impala
- Extending Impala with User Defined functions
Introduction To Other Ecosystem Tools
- NoSQL database – Hbase
- Introduction Oozie
Spark: Introduction
- Introduction to Apache Spark
- Streaming Data Vs. In Memory Data
- Map Reduce Vs. Spark
- Modes of Spark
- Spark Installation Demo
- Overview of Spark on a cluster
- Spark Standalone Cluster
Spark: Spark In Practice
- Invoking Spark Shell
- Creating the Spark Context
- Loading a File in Shell
- Performing Some Basic Operations on Files in Spark Shell
- Caching Overview
- Distributed Persistence
- Spark Streaming Overview(Example: Streaming Word Count)
Spark: Spark Meets Hive
- Analyze Hive and Spark SQL Architecture
- Analyze Spark SQL
- Context in Spark SQL
- Implement a sample example for Spark SQL
- Integrating hive and Spark SQL
- Support for JSON and Parquet File Formats Implement Data Visualization in Spark
- Loading of Data
- Hive Queries through Spark
- Performance Tuning Tips in Spark
- Shared Variables: Broadcast Variables & Accumulators
Spark Streaming
- Extract and analyze the data from twitter using Spark streaming
- Comparison of Spark and Storm – Overview
Spark GraphX
- Overview of GraphX module in spark
- Creating graphs with GraphX
Introduction To Machine Learning Using Spark
- Understand Machine learning framework
- Implement some of the ML algorithms using Spark MLLib
Project
- Consolidate all the learnings