Introduction to Data Science with Python
 What is analytics & Data Science?
 Common Terms in Analytics
 Analytics vs. Data warehousing, OLAP, MIS Reporting
 Relevance in industry and need of the hour
 Types of problems and business objectives in various industries
 How leading companies are harnessing the power of analytics?
 Critical success drivers
 Overview of analytics tools & their popularity
 Analytics Methodology & problem solving framework
 List of steps in Analytics projects
 Identify the most appropriate solution design for the given problem statement
 Project plan for Analytics project & key milestones based on effort estimates
 Build Resource plan for analytics project
Python Essentials
 Why Python for data science?
 Overview of Python Starting with Python
 Introduction to installation of Python
 Introduction to Python Editors & IDE’s(Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
 Understand Jupyter notebook & Customize Settings
 Concept of Packages/Libraries – Important packages(NumPy, SciPy, scikitlearn, Pandas, Matplotlib, etc)
 Installing & loading Packages & Name Spaces
 Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
 List and Dictionary Comprehensions
 Variable & Value Labels – Date & Time Values
 Basic Operations – Mathematical – string – date
 Reading and writing data
 Simple plotting
 Control flow & conditional statements
 Debugging & Code profiling
 How to create class and modules and how to call them?
Scientific Distributions Used In Python For Data Science
NumPy, pandas, scikitlearn, stat models, nltk
Accessing/Importing And Exporting Data Using Python Modules
 Importing Data from various sources (Csv, txt, excel, access etc)
 Database Input (Connecting to database)
 Viewing Data objects – subsetting Data, methods
 Exporting Data to various formats
 Important python modules: Pandas, beautiful soup
Data Manipulation – Cleansing – Munging using python modules
 Cleansing Data with Python
 Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversions, renaming, formatting etc)
 Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
 Python Builtin Functions (Text, numeric, date, utility functions)
 Python User Defined Functions
 Stripping out extraneous information
 Normalizing data
 Formatting data
 Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)
Data Analysis – Visualization Using Python
 Introduction exploratory data analysis
 Descriptive statistics, Frequency Tables and summarization
 Univariate Analysis (Distribution of data & Graphical Analysis)
 Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
 Creating Graphs Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
 Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and SciPy. Stats etc)
Introduction to Statistics
 Basic Statistics – Measures of Central Tendencies and Variance
 Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
 Inferential Statistics Sampling – Concept of Hypothesis Testing Statistical Methods – Z/ttests( One sample, independent, paired), Analysis of variance, Correlations and Chisquare
 Important modules for statistical methods: NumPy, SciPy, Pandas
Introduction to Predictive Modelling
 Concept of model in analytics and how it is used?
 Common terminology used in analytics & Modelling process
 Popular modelling algorithms
 Types of Business problems – Mapping of Techniques
 Different Phases of Predictive Modelling
Data Exploration For Modelling
 Need for structured exploratory data
 EDA framework for exploring the data and identifying any problems with the data (Data Audit Report)
 Identify missing data
 Identify outliers data
 Visualize the data trends and patterns
Data Preparation
 Need of Data preparation
 Consolidation/Aggregation – Outlier treatment – Flat Liners – Missing values Dummy creation – Variable Reduction
 Variable Reduction Techniques – Factor & PCA Analysis
Segmentation: Solving Segmentation Problems
 Introduction to Segmentation
 Types of Segmentation (Subjective Vs Objective, Heuristic Vs. Statistical)
 Heuristic Segmentation Techniques (Value Based, RFM Segmentation and Life Stage Segmentation)
 Behavioural Segmentation Techniques (KMeans Cluster Analysis)
 Cluster evaluation and profiling – Identify cluster characteristics
 Interpretation of results – Implementation on new data
Linear Regression: Solving Regression Problems
 Introduction – Applications
 Assumptions of Linear Regression
 Building Linear Regression Model
 Understanding standard metrics (Variable significance, Rsquare/Adjusted Rsquare, Global hypothesis ,etc)
 Assess the overall effectiveness of the model
 Validation of Models (Re running Vs. Scoring)
 Standard Business Outputs (Decile Analysis, Error distribution (histogram), Model equation, drivers etc.)
 Interpretation of Results – Business Validation – Implementation on new data
Logistic Regression : Solving Classification Problems
 Introduction – Applications
 Linear Regression Vs. Logistic Regression Vs. Generalized Linear Models
 Building Logistic Regression Model (Binary Logistic Model)
 Understanding standard model metrics (Concordance, Variable significance, Hosmer Lemeshov Test, Gini, KS, Misclassification, ROC Curve etc)
 Validation of Logistic Regression Models (Re running Vs. Scoring)
 Standard Business Outputs (Decile Analysis, ROC Curve, Probability Cutoffs, Lift charts, Model equation, Drivers or variable importance, etc)
 Interpretation of Results – Business Validation – Implementation on new data
Time Series Forecasting : Solving Forecasting Problems
 Introduction – Applications
 Time Series Components( Trend, Seasonality, Cyclicity and Level) and Decomposition
 Classification of Techniques(Pattern based – Pattern less)
 Basic Techniques – Averages, Smoothening, etc
 Advanced Techniques – AR Models, ARIMA, etc
 Understanding Forecasting Accuracy – MAPE, MAD, MSE, etc
Machine Learning : Predictive Modelling
 Introduction to Machine Learning & Predictive Modelling
 Types of Business problems – Mapping of Techniques – Regression vs. classification vs. segmentation vs. Forecasting
 Major Classes of Learning Algorithms Supervised vs Unsupervised Learning
 Different Phases of Predictive Modelling (Data Preprocessing, Sampling, Model Building, Validation)
 Overfitting (BiasVariance Trade off) & Performance Metrics
 Feature engineering & dimension reduction
 Concept of optimization & cost function
 Overview of gradient descent algorithm
 Overview of Cross validation(Bootstrapping, KFold validation etc)
 Model performance metrics (Rsquare, Adjusted Rsquare, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics )
Data Science Unsupervised Learning : Segmentation
 What is segmentation & Role of ML in Segmentation?
 Concept of Distance and related math background
 KMeans Clustering
 Expectation Maximization
 Hierarchical Clustering
 Spectral Clustering (DBSCAN)
 Principle component Analysis (PCA)
Data Science Supervised Learning : Decision Trees
 Decision Trees – Introduction – Applications
 Types of Decision Tree Algorithms
 Construction of Decision Trees through Simplified Examples; Choosing the “Best” attribute at each NonLeaf node; Entropy; Information Gain, Gini Index, Chi Square, Regression Trees
 Generalizing Decision Trees; Information Content and Gain Ratio; Dealing with Numerical Variables; other Measures of Randomness
 Pruning a Decision Tree; Cost as a consideration; Unwrapping Trees as Rules
 Decision Trees – Validation
 Overfitting – Best Practices to avoid
Supervised Learning : Ensemble Learning
 Concept of Ensembling
 Manual Ensembling Vs. Automated Ensembling
 Methods of Ensembling (Stacking, Mixture of Experts)
 Bagging (Logic, Practical Applications)
 Random forest (Logic, Practical Applications)
 Boosting (Logic, Practical Applications)
 Ada Boost
 Gradient Boosting Machines (GBM)
 XGBoost
Supervised Learning : Artificial Neural Network – ANN
 Motivation for Neural Networks and Its Applications
 Perceptron and Single Layer Neural Network, and Hand Calculations
 Learning In a Multi Layered Neural Net: Back Propagation and Conjugant Gradient Techniques
 Neural Networks for Regression
 Neural Networks for Classification
 Interpretation of Outputs and Fine tune the models with hyper parameters
 Validating ANN models
Supervised Learning : Support Vector Machines
 Motivation for Support Vector Machine & Applications
 Support Vector Regression
 Support vector classifier (Linear & NonLinear)
 Mathematical Intuition (Kernel Methods Revisited, Quadratic Optimization and Soft Constraints)
 Interpretation of Outputs and Fine tune the models with hyper parameters
 Validating SVM models
Supervised Learning :KNN
 What is KNN & Applications?
 KNN for missing treatment
 KNN For solving regression problems
 KNN for solving classification problems
 Validating KNN model
 Model fine tuning with hyper parameters
Supervised Learning : Naive Bayes
 Concept of Conditional Probability
 Bayes Theorem and Its Applications
 Naïve Bayes for classification
 Applications of Naïve Bayes in Classifications
Text Mining And Analytics
 Taming big text, Unstructured vs. Semistructured Data; Fundamentals of information retrieval, Properties of words; Creating TermDocument (TxD);Matrices; Similarity measures, Lowlevel processes (Sentence Splitting; Tokenization; PartofSpeech Tagging; Stemming; Chunking)
 Finding patterns in text: text mining, text as a graph
 Natural Language processing (NLP)
 Text Analytics – Sentiment Analysis using Python
 Text Analytics – Word cloud analysis using Python
 Text Analytics – Segmentation using KMeans/Hierarchical Clustering
 Text Analytics – Classification (Spam/Not spam)
 Applications of Social Media Analytics
 Metrics(Measures Actions) in social media analytics
 Examples & Actionable Insights using Social Media Analytics
 Important python modules for Machine Learning (SciKit Learn, stats models, scipy, nltk etc)
 Fine tuning the models using Hyper parameters, grid search, piping etc.
OR
DATASCIENCE WITH R COURSE CONTENT
 What is analytics & Data Science?
 Common Terms in Analytics
 Analytics vs. Data warehousing, OLAP, MIS Reporting
 Relevance in industry and need of the hour
 Types of problems and business objectives in various industries
 How leading companies are harnessing the power of analytics?
 Critical success drivers
 Overview of analytics tools & their popularity
 Analytics Methodology & problem solving framework
 List of steps in Analytics projects
 Identify the most appropriate solution design for the given problem statement
 Project plan for Analytics project & key milestones based on effort estimates
 Build Resource plan for analytics project
 Why R for data science?
Data Importing / Exporting
 Introduction R/RStudio – GUI
 Concept of Packages – Useful Packages (Base & Other packages)
 Data Structure & Data Types (Vectors, Matrices, factors, Data frames, and Lists)
 Importing Data from various sources (txt, dlm, excel, sas7bdata, db, etc.)
 Database Input (Connecting to database)
 Exporting Data to various formats)
 Viewing Data (Viewing partial data and full data)
 Variable & Value Labels – Date Values
Data Manipulation
 Data Manipulation steps
 Creating New Variables (calculations & Binning)
 Dummy variable creation
 Applying transformations
 Handling duplicates
 Handling missings
 Sorting and Filtering
 Subsetting (Rows/Columns)
 Appending (Row appending/column appending)
 Merging/Joining (Left, right, inner, full, outer etc)
 Data type conversions
 Renaming
 Formatting
 Reshaping data
 Sampling
 Data manipulation tools
 Operators
 Functions
 Packages
 Control Structures (if, if else)
 Loops (Conditional, iterative loops, apply functions)
 Arrays
 R Builtin Functions (Text, Numeric, Date, utility)
 Numerical Functions
 Text Functions
 Date Functions
 Utilities Functions
 R User Defined Functions
 R Packages for data manipulation (base, dplyr, plyr, data.table, reshape, car, sqldf, etc)
Data Analysis – Visualization
 ntroduction exploratory data analysis
 Descriptive statistics, Frequency Tables and summarization
 Univariate Analysis (Distribution of data & Graphical Analysis)
 Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
 Creating Graphs Bar/pie/line chart/histogram/boxplot/scatter/density etc)
 R Packages for Exploratory Data Analysis(dplyr, plyr, gmodes, car, vcd, Hmisc, psych, doby etc)
 R Packages for Graphical Analysis (base, ggplot, lattice,etc)
Introduction To Statistics
 Basic Statistics – Measures of Central Tendencies and Variance
 Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
 Inferential Statistics Sampling – Concept of Hypothesis Testing
 Statistical Methods – Z/ttests( One sample, independent, paired), Anova, Correlations and Chisquare
Predictive Modelling
 Concept of model in analytics and how it is used?
 Common terminology used in analytics & modelling process
 Popular modelling algorithms
 Types of Business problems – Mapping of Techniques
 Different Phases of Predictive Modelling
Data Exploration For Modeling
Data Preparation
 Need of Data preparation
 Consolidation/Aggregation – Outlier treatment – Flat Liners – Missing values Dummy creation – Variable Reduction
 Variable Reduction Techniques – Factor & PCA Analysis
Segmentation: Solving Segmentation Problems
 Introduction to Segmentation
 Types of Segmentation (Subjective Vs Objective, Heuristic Vs. Statistical)
 Heuristic Segmentation Techniques (Value Based, RFM Segmentation and Life Stage Segmentation)
 Behavioral Segmentation Techniques (KMeans Cluster Analysis)
 Cluster evaluation and profiling – Identify cluster characteristics
 Interpretation of results – Implementation on new data
Linear Regression: Solving Regression Problems
 Introduction – Applications
 Assumptions of Linear Regression
 Building Linear Regression Model
 Understanding standard metrics (Variable significance, Rsquare/Adjusted Rsquare, Global hypothesis ,etc)
 Assess the overall effectiveness of the model
 Validation of Models (Re running Vs. Scoring)
 Standard Business Outputs (Decile Analysis, Error distribution (histogram), Model equation, drivers etc.)
 Interpretation of Results – Business Validation – Implementation on new data
Logistic Regression: Solving Classification Problems
 Introduction – Applications
 Linear Regression Vs. Logistic Regression Vs. Generalized Linear Models
 Building Logistic Regression Model (Binary Logistic Model)
 Understanding standard model metrics (Concordance, Variable significance, Hosmer Lemeshov Test, Gini, KS, Misclassification, ROC Curve etc)
 Validation of Logistic Regression Models (Re running Vs. Scoring)
 Standard Business Outputs (Decile Analysis, ROC Curve, Probability Cutoffs, Lift charts, Model equation, Drivers or variable importance, etc)
 Interpretation of Results – Business Validation – Implementation on new data
Time Series Forecasting: Solving Forecasting Problems
 Introduction – Applications
 Time Series Components( Trend, Seasonality, Cyclicity and Level) and Decomposition
 Classification of Techniques(Pattern based – Pattern less)
 Basic Techniques – Averages, Smoothening, etc
 Advanced Techniques – AR Models, ARIMA, etc
 Understanding Forecasting Accuracy – MAPE, MAD, MSE, etc
Machine Learning Predictive Modeling – Basics
 Introduction to Machine Learning & Predictive Modeling
 Types of Business problems – Mapping of Techniques – Regression vs. classification vs. segmentation vs. Forecasting
 Major Classes of Learning Algorithms Supervised vs Unsupervised Learning
 Different Phases of Predictive Modeling (Data Preprocessing, Sampling, Model Building, Validation)
 Overfitting (BiasVariance Trade off) & Performance Metrics
 Feature engineering & dimension reduction
 Concept of optimization & cost function
 Overview of gradient descent algorithm
 Overview of Cross validation(Bootstrapping, KFold validation etc)
 Model performance metrics (Rsquare, Adjusted Rsqure, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics )
Unsupervised Learning: Segmentation
 What is segmentation & Role of ML in Segmentation?
 Concept of Distance and related math background
 KMeans Clustering
 Expectation Maximization
 Hierarchical Clustering
 Spectral Clustering (DBSCAN)
 Principle component Analysis (PCA)
Supervised Learning: Decision Trees
 Decision Trees – Introduction – Applications
 Types of Decision Tree Algorithms
 Construction of Decision Trees through Simplified Examples; Choosing the “Best” attribute at each NonLeaf node; Entropy; Information Gain, Gini Index, Chi Square, Regression Trees
 Generalizing Decision Trees; Information Content and Gain Ratio; Dealing with Numerical Variables; other Measures of Randomness
 Pruning a Decision Tree; Cost as a consideration; Unwrapping Trees as Rules
 Decision Trees – Validation
 Overfitting – Best Practices to avoid
Supervised Learning: Ensemble Learning
 Concept of Ensembling
 Manual Ensembling Vs. Automated Ensembling
 Methods of Ensembling (Stacking, Mixture of Experts)
 Bagging (Logic, Practical Applications)
 Random forest (Logic, Practical Applications)
 Boosting (Logic, Practical Applications)
 Ada Boost
 Gradient Boosting Machines (GBM)
 XGBoost
Supervised Learning: Artificial Neural Networks (ANN)
 Motivation for Neural Networks and Its Applications
 Perceptron and Single Layer Neural Network, and Hand Calculations
 Learning In a Multi Layered Neural Net: Back Propagation and Conjugant Gradient Techniques
 Neural Networks for Regression
 Neural Networks for Classification
 Interpretation of Outputs and Fine tune the models with hyper parameters
 Validating ANN models
Supervised Learning: Support Vector Machines
 Motivation for Support Vector Machine & Applications
 Support Vector Regression
 Support vector classifier (Linear & NonLinear)
 Mathematical Intuition (Kernel Methods Revisited, Quadratic Optimization and Soft Constraints)
 Interpretation of Outputs and Fine tune the models with hyper parameters
 Validating SVM models
Supervised Learning: KNN
 What is KNN & Applications?
 KNN for missing treatment
 KNN For solving regression problems
 KNN for solving classification problems
 Validating KNN model
 Model fine tuning with hyper parameters
Supervised Learning: Naïve Bayes
 Concept of Conditional Probability
 Bayes Theorem and Its Applications
 Naïve Bayes for classification
 Applications of Naïve Bayes in Classifications
Text Mining & Analytics
 Taming big text, Unstructured vs. Semistructured Data; Fundamentals of information retrieval, Properties of words; Creating TermDocument (TxD);Matrices; Similarity measures, Lowlevel processes (Sentence Splitting; Tokenization; PartofSpeech Tagging; Stemming; Chunking)
 Finding patterns in text: text mining, text as a graph
 Natural Language processing (NLP)
 Text Analytics – Sentiment Analysis using R
 Text Analytics – Word cloud analysis using R
 Text Analytics – Segmentation using KMeans/Hierarchical Clustering
 Text Analytics – Classification (Spam/Not spam)
 Applications of Social Media Analytics
 Metrics(Measures Actions) in social media analytics
 Examples & Actionable Insights using Social Media Analytics
 Important R packages for Machine Learning (caret, H2O, Randomforest, nnet, tm etc)
 Fine tuning the models using Hyper parameters, grid search, piping etc.
Project
Case Studies
OR
DATASCIENCE TRAINING WITH SAS COURSE CONTENT
Introduction To Analytics
 Analytics World
 Introduction to Analytics
 Concept of ETL
 SAS in advanced analytics
 Global Certification: Induction and walk through
 Getting Started
 Software installation
 Introduction to GUI
 Different components of the language
 All programming windows
 Concept of Libraries and Creating Libraries
 Variable Attributes – (Name, Type, Length, Format, In format, Label)
 Importing Data and Entering data manually
 Understanding Datasets
 Descriptor Portion of a Dataset (Proc Contents)
 Data Portion of a Dataset
 Variable Names and Values
 Data Libraries
Base SAS – Accessing The Data
 Understanding Data Step Processing
 Data Step and Proc Step
 Data step execution
 Compilation and execution phase
 Input buffer and concept of PDV
 Importing Raw Data Files
 Column Input and List Input and Formatted methods
 Delimiters, Reading missing and non standard values
 Reading one to many and many to one records
 Reading Hierarchical files
 Creating raw data files and put statement
 Formats / Informat
 Importing and Exporting Data (Fixed Format / Delimited)
 Proc Import / Delimited text files
 Proc Export / Exporting Data
 Datalines / Cards;
 Atypical importing cases (mixing different style of inputs)
 Reading Multiple Records per Observation
 Reading “Mixed Record Types”
 Subsetting from a Raw Data File
 Multiple Observations per Record
 Reading Hierarchical Files

 Concept of SAS library and SAS Catalog
 Variable Types in SAS
 Reading Data stored external to SAS
 Importing Data by using Proc Import
 Data Step SAS statements
 SAS Functions
 Appending and Merging using SAS
 SAS Procedures like proc means, proc Univariate, proc append, proc freq and proc export.
 SAS SQL
 SAS Macros
Hypothesis Testing and ANOVA
 One Sample ttest of comparing means
 Two Sample ttest of comparing means
 One Way ANOVA
 Assumptions of ANOVA Modeling
 nway ANOVA
 ANOVA Post Hoc Studies
Measure Model Performance
 Apply the principles of honest assessment to model performance measurement
 Assess classifier performance using the confusion matrix
 Model selection and validation using training and validation data
 Create and interpret graphs (ROC, lift, and gains charts) for model comparison and selection
 Establish effective decision cutoff values for scoring
Data Understanding, Managing And Manipulation
 Understanding and Exploration Data
 Introduction to basic Procedures – Proc Contents, Proc Print
 Understanding and Exploration Data
 Operators and Operands
 Conditional Statements (Where, If, If then Else, If then Do and select when)
 Difference between WHERE and IF statements and limitation of WHERE statements
 Labels, Commenting
 System Options (OBS, FSTOBS, NOOBS etc…)
 Data Manipulation
 Proc Sort – with options / DeDuping
 Accumulator variable and ByGroup processing
 Explicit Output Statements
 Nesting Do loops
 Do While and Do Until Statement
 Array elements and Range
 Combining Datasets (Appending and Merging)
 Concatenation
 Interleaving
 Proc Append
 One To One Merging
 Match Merging
 IN = Controlling merge and Indicator
Data Mining With Proc SQL
 Introduction to Databases
 Introduction to Proc SQL
 Basics of General SQL language
 Creating table and Inserting Values
 Retrieve & Summarize data
 Group, Sort & Filter
 Using Joins (Full, Inner, Left, Right and Outer)
 Reporting and summary analysis
 Concept of Indexes and creating Indexes (simple and composite)
 Connecting SAS to external Databases
 Implicit and Explicit pass through methods
Macros For Automation
 Macro Parameters and Variables
 Different types of Macro Creation
 Defining and calling a macro
 Using call Symput and Symget
 Macros options (mprint symbolgen mlogic merror serror)
Fundamental Of Statistics
 Basic Statistics – Measures of Central Tendencies and Variance
 Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
 Inferential Statistics Sampling – Concept of Hypothesis Testing
 Statistical Methods – Z/ttests( One sample, independent, paired), Anova, Correlations and Chisquare
 Levels of Measurement and Variable types
 Descriptive Statistics and Picturing Distributions
 Confidence Interval for the Mean
Introduction To Predictive Modelling
 Introduction to Predictive Modeling
 Types of Business problems – Mapping of Techniques
 Different Phases of Predictive Modeling
Data Preparation
 Need of Data preparation
 Data Audit Report and Its importance
 Consolidation/Aggregation – Outlier treatment – Flat Liners – Missing values Dummy creation – Variable Reduction
 Variable Reduction Techniques – Factor & PCA Analysis
Segmentation
 Introduction to Segmentation
 Types of Segmentation (Subjective Vs Objective, Heuristic Vs. Statistical)
 Heuristic Segmentation Techniques (Value Based, RFM Segmentation and Life Stage Segmentation)
 Behavioural Segmentation Techniques (KMeans Cluster Analysis)
 Cluster evaluation and profiling
 Interpretation of results – Implementation on new data
Linear Regression
 Introduction – Applications
 Assumptions of Linear Regression
 Building Linear Regression Model
 Understanding standard metrics (Variable significance, Rsquare/Adjusted Rsquare, Global hypothesis ,etc)
 Validation of Models (Re running Vs. Scoring)
 Standard Business Outputs (Decile Analysis, Error distribution (histogram), Model equation, drivers etc.)
 Interpretation of Results – Business Validation – Implementation on new data
Logistic Regression
 Introduction – Applications
 Linear Regression Vs. Logistic Regression Vs. Generalized Linear Models
 Building Logistic Regression Model
 Understanding standard model metrics (Concordance, Variable significance, Hosmer Lemeshov Test, Gini, KS, Misclassification, etc)
 Validation of Logistic Regression Models (Re running Vs. Scoring)
 Standard Business Outputs (Decile Analysis, ROC Curve,
Probability Cutoffs, Lift charts, Model equation, Drivers, etc)  Interpretation of Results – Business Validation Implementation on new data
Time Series Forecasting
 Introduction – Applications
 Time Series Components( Trend, Seasonality, Cyclicity and Level) and Decomposition
 Classification of Techniques(Pattern based – Pattern less)
 Basic Techniques – Averages, Smoothening, etc
 Advanced Techniques – AR Models, ARIMA, etc
 Understanding Forecasting Accuracy – MAPE, MAD, MSE, etc
Introduction To Machine Learning
 Statistical learning vs. Machine learning
 Major Classes of Learning Algorithms Supervised vs Unsupervised Learning
 Concept of Overfitting and Under fitting (BiasVariance Trade off) & Performance Metrics
 Types of Cross validation(Train & Test, Bootstrapping, KFold validation etc)
Regression & Classification Model Building
 Recursive Partitioning(Decision Trees)
 Ensemble Models(Random Forest, Bagging & Boosting)
 KNearest neighbours
OR
ADVANCED BIG DATASCIENCE COURSE CONTENT
Introduction To Data Science
 What is Data Science?
 Why Python for data science?
 Relevance in industry and need of the hour
 How leading companies are harnessing the power of Data Science with Python?
 Different phases of a typical Analytics/Data Science projects and role of python
 Anaconda vs. Python
Python Essentials (Core)
 Overview of Python Starting with Python
 Introduction to installation of Python
 Introduction to Python Editors & IDE’s(Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
 Understand Jupyter notebook & Customize Settings
 Concept of Packages/Libraries – Important packages(NumPy, SciPy, scikitlearn, Pandas, Matplotlib, etc)
 Installing & loading Packages & Name Spaces
 Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
 List and Dictionary Comprehensions
 Variable & Value Labels – Date & Time Values
 Basic Operations – Mathematical – string – date
 Reading and writing data
 Simple plotting
 Control flow & conditional statements
 Debugging & Code profiling
 How to create class and modules and how to call them?
 Scientific distributions used in python for Data Science – Numpy, scify, pandas, scikitlearn, statmodels, nltk etc
Accessing/Importing And Exporting Data Using Python Modules
 Importing Data from various sources (Csv, txt, excel, access etc)
 Database Input (Connecting to database)
 Viewing Data objects – subsetting, methods
 Exporting Data to various formats
 Important python modules: Pandas, beautifulsoup
Data Manipulation – Cleansing – Munging Using Python Modules
 Cleansing Data with Python
 Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversions, renaming, formatting etc)
 Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
 Python Builtin Functions (Text, numeric, date, utility functions)
 Python User Defined Functions
 Stripping out extraneous information
 Normalizing data
 Formatting data
 Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)
Data Analysis – Visualization Using Python
 Introduction exploratory data analysis
 Descriptive statistics, Frequency Tables and summarization
 Univariate Analysis (Distribution of data & Graphical Analysis)
 Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
 Creating Graphs Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
 Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats etc)
Basic Statistics & Implementation Of Stats Methods In Python
 Basic Statistics – Measures of Central Tendencies and Variance
 Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
 Inferential Statistics Sampling – Concept of Hypothesis Testing
 Statistical Methods – Z/ttests (One sample, independent, paired), Anova, Correlation and Chisquare
 Important modules for statistical methods: Numpy, Scipy, Pandas
Python: Machine Learning Predictive Modeling – Basics
 Introduction to Machine Learning & Predictive Modeling
 Types of Business problems – Mapping of Techniques – Regression vs. classification vs. segmentation vs. Forecasting
 Major Classes of Learning Algorithms Supervised vs Unsupervised Learning
 Different Phases of Predictive Modeling (Data Preprocessing, Sampling, Model Building, Validation)
 Overfitting (BiasVariance Trade off) & Performance Metrics
 Feature engineering & dimension reduction
 Concept of optimization & cost function
 Concept of gradient descent algorithm
 Concept of Cross validation(Bootstrapping, KFold validation etc)
 Model performance metrics (Rsquare, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics)
Machine Learning Algorithms & Applications – Implementation In Python
 Linear & Logistic Regression
 Segmentation – Cluster Analysis (KMeans)
 Decision Trees (CART/CD 5.0)
 Ensemble Learning (Random Forest, Bagging & boosting)
 Artificial Neural Networks(ANN)
 Support Vector Machines(SVM)
 Other Techniques (KNN, Naïve Bayes, PCA)
 Introduction to Text Mining using NLTK
 Introduction to Time Series Forecasting (Decomposition & ARIMA)
 Important python modules for Machine Learning (SciKit Learn, stats models, scipy, nltk etc)
 Fine tuning the models using Hyper parameters, grid search, piping etc.
Project – Consolidate Learnings
 Applying different algorithms to solve the business problems and bench mark the results
Introduction To Big Data
 Introduction and Relevance
 Uses of Big Data analytics in various industries like Telecom, E commerce, Finance and Insurance etc.
 Problems with Traditional LargeScale Systems
Hadoop(Big Data) EcoSystem
 Motivation for Hadoop
 Different types of projects by Apache
 Role of projects in the Hadoop Ecosystem
 Key technology foundations required for Big Data
 Limitations and Solutions of existing Data Analytics Architecture
 Comparison of traditional data management systems with Big Data management systems
 Evaluate key framework requirements for Big Data analytics
 Hadoop Ecosystem & Hadoop 2.x core components
 Explain the relevance of realtime data
 Explain how to use Big Data and realtime data as a Business planning tool
Hadoop ClusterArchitectureConfiguration Files
 Hadoop MasterSlave Architecture
 The Hadoop Distributed File System – Concept of data storage
 Explain different types of cluster setups(Fully distributed/Pseudo etc)
 Hadoop cluster set up – Installation
 Hadoop 2.x Cluster Architecture
 A Typical enterprise cluster – Hadoop Cluster Modes
 Understanding cluster management tools like Cloudera manager/Apache ambari
HadoopHDFS & MapReduce (YARN)
 HDFS Overview & Data storage in HDFS
 Get the data into Hadoop from local machine(Data Loading Techniques) – vice versa
 Map Reduce Overview (Traditional way Vs. MapReduce way)
 Concept of Mapper & Reducer
 Understanding MapReduce program Framework
 Develop MapReduce Program using Java (Basic)
 Develop MapReduce program with streaming API) (Basic)
Data Integration Using Sqoop & Flume
 Integrating Hadoop into an Existing Enterprise
 Loading Data from an RDBMS into HDFS by Using Sqoop
 Managing RealTime Data Using Flume
 Accessing HDFS from Legacy Systems
Data Analysis Using Pig
 Introduction to Data Analysis Tools
 Apache PIG – MapReduce Vs Pig, Pig Use Cases
 PIG’s Data Model
 PIG Streaming
 Pig Latin Program & Execution
 Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
 Writing JAVA UDF’s
 Embedded PIG in JAVA
 PIG Macros
 Parameter Substitution
 Use Pig to automate the design and implementation of MapReduce applications
 Use Pig to apply structure to unstructured Big Data
Data Analysis Using Hive
 Apache Hive – Hive Vs. PIG – Hive Use Cases
 Discuss the Hive data storage principle
 Explain the File formats and Records formats supported by the Hive environment
 Perform operations with data in Hive
 Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
 Hive Script, Hive UDF
 Hive Persistence formats
 Loading data in Hive – Methods
 Serialization & Deserialization
 Handling Text data using Hive
 Integrating external BI tools with Hadoop Hive
Data Analysis Using Impala
 Impala & Architecture
 How Impala executes Queries and its importance
 Hive vs. PIG vs. Impala
 Extending Impala with User Defined functions
Introduction To Other Ecosystem Tools
 NoSQL database – Hbase
 Introduction Oozie
Spark: Introduction
 Introduction to Apache Spark
 Streaming Data Vs. In Memory Data
 Map Reduce Vs. Spark
 Modes of Spark
 Spark Installation Demo
 Overview of Spark on a cluster
 Spark Standalone Cluster
Spark: Spark In Practice
 Invoking Spark Shell
 Creating the Spark Context
 Loading a File in Shell
 Performing Some Basic Operations on Files in Spark Shell
 Caching Overview
 Distributed Persistence
 Spark Streaming Overview(Example: Streaming Word Count)
Spark: Spark Meets Hive
 Analyze Hive and Spark SQL Architecture
 Analyze Spark SQL
 Context in Spark SQL
 Implement a sample example for Spark SQL
 Integrating hive and Spark SQL
 Support for JSON and Parquet File Formats Implement Data Visualization in Spark
 Loading of Data
 Hive Queries through Spark
 Performance Tuning Tips in Spark
 Shared Variables: Broadcast Variables & Accumulators
Spark Streaming
 Extract and analyze the data from twitter using Spark streaming
 Comparison of Spark and Storm – Overview
Spark GraphX
 Overview of GraphX module in spark
 Creating graphs with GraphX
Introduction To Machine Learning Using Spark
 Understand Machine learning framework
 Implement some of the ML algorithms using Spark MLLib
Project
 Consolidate all the learnings
 Working on Big Data Project by integrating various key components
Projects :
Python Projects
Random password generator  Mini 
CLI based scientific calculator  Mini 
Instagram bot  Mini 
Expense Tracker  Mini 
Site connectivity checker  Mini 
Lawn Tennis Match Highlight (Can be extended to any sport)  Major 
NLP library  Major 
Deep Learning Projects
Churn Modelling using ANN  Mini 
Image Classification  Mini 
Image classification using Transfer learning  Major 
Sentence Classification using RNN,LSTM,GRU  Mini 
Sentence Classification using word embeddings  Major 
Object Detection using yolo  Major 
Machine Learning Projects
EDA on movies database  Mini 
House price prediction using Regression  Mini 
Predict survival on the Titanic using Classification  Mini 
Image Clustering  Mini 
Document Clustering  Mini 
Twitter US Airline Sentiment  Major 
Restaurant revenue prediction  Major 
Disease Prediction  Major 
Note: Depends upon Trainers above projects may vary