ISI Workshop- Hand Note
ISI Workshop- 16th-20th , 2015. ISI campus, Kolkata
Class –Basic probability distribution- Prof. Arnab Chakrabarty
Random experiment follows the IID properties- here IID means independent identical
Example of IID- Let we circulate an questionnaire among the participants of ISI workshop
regarding performance of a particular teacher. In the first stage there will be no bias among the
participants of ISI for giving their choice regarding performance (good, bad, very good and
excellent). In the second stage, if same questionnaire is circulated among the participants of
particular community of ISI for giving their choice regarding performance (good, bad, very good
and excellent). Pls mind it here the teacher is from same community, where participants are
going to give their choice regarding performance. In this case there will be bias in giving choice.
This is not IID. Because here the property of independent lost for giving choice. In the first case
the property of independent holds for giving choice.
Sample space is the set of possible outcomes.
Random values-Random experiment-Distribution [1. Standard distribution, 2. Non-standard
Here the distribution is 2 types [1. Discrete distribution, 2. Continuous distribution]
1. Discrete- Sample space is finite/countable- called probability mass function- follows the
summation for adding of total outcome/probability. Means to know the whole probability
we will have to add all the outcomes to get the probability. [whole probability is 1]
2. Continuous- It is follows the proportion- or in advance term called relative frequency.
Similarly, to know the total probability, we will have to integrate the total outcomes.
[whole probability is 1]
Difference between Discrete distribution and Continuous distribution
Difference between Discrete distribution and continuous distribution: Discrete has property of
countable and finite. We can count the observation like 1,2,3,4…….like this. The integer
property holds for the case of discrete distribution. In the case of continuous distribution,
it is infinite and uncountable. Where we can’t count it as discrete, rather we can measure
from certain point to certain point, means interval wise. For example- what is next of 1?
We can’t say 2 in the case of continuous distribution. But in the case of discrete
distribution, we can interpret as it in finite term. In the case of continuous distribution, the
word tens to 0,1,2……is used. The spreadness of normal distribution is from –infinite to
+ infinite. To measure the certain area in probability way, we will have to define from
which point to which point we are interested (That means in interval way). To know the
area of that curve we will have to use integration (for example definite integration). But
the same rule doesn’t hold in the case of discrete distribution, for example binomial
distribution, we are interested in success and failure of the events. Here we can easily
count, how many are success and how many are failure event. To know the probability of
that event, we will have to summate it (for example addition).…………Pls expand it or
The shape of the probability distribution has many forms-like normal, binomial, possion,
Here density of continuous probability distribution= Relative frequency/ Length of term
Probability of distribution [1. Standard probability distribution, 2. Non-standard probability
Definition of Probability distribution:-
Definition of Probability distribution of random values/random experiment x is a statement that
tells as 1. Sample space of x, 2. Probability that x has in any given subset of the sample space
Standard distribution [for discrete data – Binomial distribution, 2. Poisson distribution]
1. For visual display we will have to make a bar chart to know the distribution frequency
Standard distribution [for continuous data- Normal distribution, Log normal distribution]
1. For visual display we will have to make histogram to know the distribution frequency.
If data distribution belong to a standard family – follows the parametric distribution otherwise it
is called non-parametric distribution.
Advantages of standard parametric distribution
Recommended book by prof. Arnab Chakarborty (AC)- for probability distribution follow the
book Johnson and Kotz- you can find this book from free web book site-book.org or library
Class –Inference statistics-AB
Here X= is the random variable, a= is the specific value supported of random variable
Inference problem has two types [1. Estimation, 2. Hypothesis testing]
1. Estimation [ a. Point estimation, b. Interval estimation ]
A. Point estimation has
1. Method of moment
2. Method of maximum likelihood
3. Method of minimum chi-square
4. Bayes estimation method
Class –Inference statistics-SP
There are two types of sampling in research for collection of units from the population.
Here the population means the area/units you are focusing.
Here we have various sampling for collection of data are as follow-
1. Probability sampling
2. Non-probability sampling
Here I am mentioning some important sampling techniques that may be useful in your
Let us discuss about probability sampling
1. Simple random sampling- Definition you will get somewhere else. But basic thing –
every outcome should have equal chance of coming in every draws. This can be possible
A. With replacement- formula = N^n
B. Without replacement – formula = Ncn
Drawing sampling without replacement is more efficient than with replacement.
2. Systematic sampling- is basically two types
A. Linear systematic sampling (k = N/n (integer)
B. Circular systematic sampling (k =N/n (not integer) ….. for further understanding pls
see the full material
Probability of selecting systematic sampling = 1/k
To go through the details pls see the book as recommended by Madam (SP) :
Mahalanobis book)
Here statistical terms- Inclusion of probability and fuzzy probability- for details go through the
Inclusion of probability- The probability of selecting third unit in your sample is called inclusion
For collection of sampling units from your targeted population, you may go through the
following methods are as follow.
1. Hansen-Hauwitz-estimator
2. Horvitz Thompson – estimator
3. Ratio and regression method- estimator
To convert the non-linear statistics to linear statistics, Taylor series expansion method is
If estimator is non-linear, use the bootstrap method to compute the estimate and variance
To judge the accuracy of sampling estimator, do cross-check using simulation technique.
Concept of P-vales Professor told a story that I am elaborating here- you may apply it
A bright example to make a valid question
Question- Is it too old? Here we can’t validate this type of question without knowing the
background information about the kind of pet we have with us.
First of all we need to know the background information about the pet we have with us like
whether that is cat, turtle or something else. If that is cat, then it is ok and question (Is it too old?)
is valid. If that is turtle, then question is not valid- because a turtle has long live. Before
proceeding we should give attention regarding the question we have interest to ask or to know.
R codes – here I am writing the code that is purely biased. Means what I am not comfortable.
R-console and R-editor file. In the R-console we can’t modify our command. But in the R-editor
file we can modify and change according to our comfortable.
Better use before going to workspace R-editor file for writing command.
In the R-editor file just select (Ctrl+R) and run the command. It will give the output on your
workspace. Any command after # is taken by R as function./command.
Some commands, here I am writing.
pi- will give the value of pi
x<-seq(-6,6,0.01)
plot(x,y,type=”l”)- here l means line
y1=dnorm(x,0,2)
To know the base package of R= help(package=base)
To get the current directory- getwd()
Important information- For installation of any packages in R, open the R-work software using
the administrator password. It would be reliable way in installing any packages in R. Just right
click on R-icon on your PC, there you will find an option of “run by administrator” and click it.
Class –Simple and multiple regression-SMB
Book recommended – This purely conceptual and no mathematics for regression analysis-
Statistics by Freedman, Pisani and Purves
Here there is three terms that are used frequently in regression
a.Correlation, b. Association, c. Causation
The correlation coefficient can be calculated only when there is linearity assumption
between two variables.
Correlation does not mean causation- here causation means cause effect relationship-for
example if you hit the iron (cause), it would be flexible (effect).
Correlation only measures linear association
Zero correlation but non-linear association is possible.
Here I am writing an example that had given by professor
There is husband IQ-140, we predict wife IQ-120.We can’t predict wife IQ-120, taking husband
IQ-140 in reverse way. Yes it is possible only when there is perfect correlation between husband
IQ and wife IQ i.e., (+-) 1. Reverse way prediction is not possible only when there is no perfect
correlation between two variables (dependent and independent variable).
Table-1 Application of technique based on data properties
Response variable (Dependent variable) Predictor variable (Independent
Data - Continuous Data-continuous- Regression
Data- Continuous Data- Categorical- ANOVA
Data- Categorical Data- Continuous- (classification)
Data-categorical, >2 response variable Data- Continuous- Poisson distribution
*Data= discrete variable Data= discrete variable-
Categorical analysis
* Categorical analysis followed by contingency table (for testing the hypothesis, standard Chi-
Definition of level of significance in other way
Level of significance means we are allowing Type-1 error (not Type-11 error that is most
dangerous) up to a certain limit using 5 percent, 1 percent or 10 percent of data. Here we assume
that there is possible of error of 5 percent data, 1 percent data or 10 percent data. For example, let
we have 100 data points, here we are taking 5% level of significance, that means we are allowing
5 points data out of 100 data points that might be affected by Type-1 error (not Type- 11 error).
In other words, tolerance level of researcher on his data. That means how many data points he is
sacrificing for Type -1 error.
Class –Simple and multiple regression-SMB
Some conceptually things are discussed as follow.
We say association when we will have categorical variables
Similarly when we have continuous variable we say relation between variables.
Odd Ratio- meaning- Ratio of probability occurred and probability does not occur
There are three ways we can determine the dependent variable of categorical data, like 1.
Absolute values, 2. Relative risk, 3. Odd ratio.
For drawing the inference of the logistic regression, the formal test like Pearson Chi-square test
is used to test the significant issue of logistic regression.
In logistic regression, response variable is 0, 1 (binary, Yes or No), but predictor variables are
continuous. The dependent variable of logistic regression is log odd ratio instead of simple odd
Time series- Just two lines- Difference between additive model and multiplicative model.
When fluctuation, is not constant, use the multiplicative model, otherwise when there is not
constant fluctuation, use additive model in time series.
Class –Multivariate method-SMB
We generally face in day to day life the problem of classification. For example, whether that
customer has capacity to pay loan or not, whether that person will get cured if he continues the
particular medicine, whether I should visit to that doctor, like this. These are examples of ‘yes”
or “no” question. This type of classification problem, we call it supervised classification
problem. This means we know already there is two possibilities i.e., yes or no like this. Besides
the yes/no questions, there are certain classification problems in real life we face that is very
difficult to classify. For example, after securing good percentage in higher secondary (+2
science), whether she/he should go for engineering course, medical course, basic science course,
CEPET course, B.math course, B.stat course, certain type of animation course like this. This
means where multiple classification is possible to take the decision. This type of classification
problem, we call it unsupervised classification problem. Here we can’t convert into ‘yes’ or ‘No’
For supervised classification problem, we generally deal with conjoint analysis.
Discriminant may be linear or quadrdic.
For unsupervised problem, we deal with principle component analysis, cluster analysis
For principle component analysis- data set in R- USArrests
This material is completely written by Ramesh Chandra Das, Research Scholar at IIT kharagpur.
If any error is there pls let me know and will try to rectify as per my best of my capapcity.
He can be reached on e-mail address- rameshchandradas99@gmail.com
Class –Basic probability distribution- Prof. Arnab Chakrabarty
Random experiment follows the IID properties- here IID means independent identical
Example of IID- Let we circulate an questionnaire among the participants of ISI workshop
regarding performance of a particular teacher. In the first stage there will be no bias among the
participants of ISI for giving their choice regarding performance (good, bad, very good and
excellent). In the second stage, if same questionnaire is circulated among the participants of
particular community of ISI for giving their choice regarding performance (good, bad, very good
and excellent). Pls mind it here the teacher is from same community, where participants are
going to give their choice regarding performance. In this case there will be bias in giving choice.
This is not IID. Because here the property of independent lost for giving choice. In the first case
the property of independent holds for giving choice.
Sample space is the set of possible outcomes.
Random values-Random experiment-Distribution [1. Standard distribution, 2. Non-standard
Here the distribution is 2 types [1. Discrete distribution, 2. Continuous distribution]
1. Discrete- Sample space is finite/countable- called probability mass function- follows the
summation for adding of total outcome/probability. Means to know the whole probability
we will have to add all the outcomes to get the probability. [whole probability is 1]
2. Continuous- It is follows the proportion- or in advance term called relative frequency.
Similarly, to know the total probability, we will have to integrate the total outcomes.
[whole probability is 1]
Difference between Discrete distribution and Continuous distribution
Difference between Discrete distribution and continuous distribution: Discrete has property of
countable and finite. We can count the observation like 1,2,3,4…….like this. The integer
property holds for the case of discrete distribution. In the case of continuous distribution,
it is infinite and uncountable. Where we can’t count it as discrete, rather we can measure
from certain point to certain point, means interval wise. For example- what is next of 1?
We can’t say 2 in the case of continuous distribution. But in the case of discrete
distribution, we can interpret as it in finite term. In the case of continuous distribution, the
word tens to 0,1,2……is used. The spreadness of normal distribution is from –infinite to
+ infinite. To measure the certain area in probability way, we will have to define from
which point to which point we are interested (That means in interval way). To know the
area of that curve we will have to use integration (for example definite integration). But
the same rule doesn’t hold in the case of discrete distribution, for example binomial
distribution, we are interested in success and failure of the events. Here we can easily
count, how many are success and how many are failure event. To know the probability of
that event, we will have to summate it (for example addition).…………Pls expand it or
The shape of the probability distribution has many forms-like normal, binomial, possion,
Here density of continuous probability distribution= Relative frequency/ Length of term
Probability of distribution [1. Standard probability distribution, 2. Non-standard probability
Definition of Probability distribution:-
Definition of Probability distribution of random values/random experiment x is a statement that
tells as 1. Sample space of x, 2. Probability that x has in any given subset of the sample space
Standard distribution [for discrete data – Binomial distribution, 2. Poisson distribution]
1. For visual display we will have to make a bar chart to know the distribution frequency
Standard distribution [for continuous data- Normal distribution, Log normal distribution]
1. For visual display we will have to make histogram to know the distribution frequency.
If data distribution belong to a standard family – follows the parametric distribution otherwise it
is called non-parametric distribution.
Advantages of standard parametric distribution
Recommended book by prof. Arnab Chakarborty (AC)- for probability distribution follow the
book Johnson and Kotz- you can find this book from free web book site-book.org or library
Class –Inference statistics-AB
Here X= is the random variable, a= is the specific value supported of random variable
Inference problem has two types [1. Estimation, 2. Hypothesis testing]
1. Estimation [ a. Point estimation, b. Interval estimation ]
A. Point estimation has
1. Method of moment
2. Method of maximum likelihood
3. Method of minimum chi-square
4. Bayes estimation method
Class –Inference statistics-SP
There are two types of sampling in research for collection of units from the population.
Here the population means the area/units you are focusing.
Here we have various sampling for collection of data are as follow-
1. Probability sampling
2. Non-probability sampling
Here I am mentioning some important sampling techniques that may be useful in your
Let us discuss about probability sampling
1. Simple random sampling- Definition you will get somewhere else. But basic thing –
every outcome should have equal chance of coming in every draws. This can be possible
A. With replacement- formula = N^n
B. Without replacement – formula = Ncn
Drawing sampling without replacement is more efficient than with replacement.
2. Systematic sampling- is basically two types
A. Linear systematic sampling (k = N/n (integer)
B. Circular systematic sampling (k =N/n (not integer) ….. for further understanding pls
see the full material
Probability of selecting systematic sampling = 1/k
To go through the details pls see the book as recommended by Madam (SP) :
Mahalanobis book)
Here statistical terms- Inclusion of probability and fuzzy probability- for details go through the
Inclusion of probability- The probability of selecting third unit in your sample is called inclusion
For collection of sampling units from your targeted population, you may go through the
following methods are as follow.
1. Hansen-Hauwitz-estimator
2. Horvitz Thompson – estimator
3. Ratio and regression method- estimator
To convert the non-linear statistics to linear statistics, Taylor series expansion method is
If estimator is non-linear, use the bootstrap method to compute the estimate and variance
To judge the accuracy of sampling estimator, do cross-check using simulation technique.
Concept of P-vales Professor told a story that I am elaborating here- you may apply it
A bright example to make a valid question
Question- Is it too old? Here we can’t validate this type of question without knowing the
background information about the kind of pet we have with us.
First of all we need to know the background information about the pet we have with us like
whether that is cat, turtle or something else. If that is cat, then it is ok and question (Is it too old?)
is valid. If that is turtle, then question is not valid- because a turtle has long live. Before
proceeding we should give attention regarding the question we have interest to ask or to know.
R codes – here I am writing the code that is purely biased. Means what I am not comfortable.
R-console and R-editor file. In the R-console we can’t modify our command. But in the R-editor
file we can modify and change according to our comfortable.
Better use before going to workspace R-editor file for writing command.
In the R-editor file just select (Ctrl+R) and run the command. It will give the output on your
workspace. Any command after # is taken by R as function./command.
Some commands, here I am writing.
pi- will give the value of pi
x<-seq(-6,6,0.01)
plot(x,y,type=”l”)- here l means line
y1=dnorm(x,0,2)
To know the base package of R= help(package=base)
To get the current directory- getwd()
Important information- For installation of any packages in R, open the R-work software using
the administrator password. It would be reliable way in installing any packages in R. Just right
click on R-icon on your PC, there you will find an option of “run by administrator” and click it.
Class –Simple and multiple regression-SMB
Book recommended – This purely conceptual and no mathematics for regression analysis-
Statistics by Freedman, Pisani and Purves
Here there is three terms that are used frequently in regression
a.Correlation, b. Association, c. Causation
The correlation coefficient can be calculated only when there is linearity assumption
between two variables.
Correlation does not mean causation- here causation means cause effect relationship-for
example if you hit the iron (cause), it would be flexible (effect).
Correlation only measures linear association
Zero correlation but non-linear association is possible.
Here I am writing an example that had given by professor
There is husband IQ-140, we predict wife IQ-120.We can’t predict wife IQ-120, taking husband
IQ-140 in reverse way. Yes it is possible only when there is perfect correlation between husband
IQ and wife IQ i.e., (+-) 1. Reverse way prediction is not possible only when there is no perfect
correlation between two variables (dependent and independent variable).
Table-1 Application of technique based on data properties
Response variable (Dependent variable) Predictor variable (Independent
Data - Continuous Data-continuous- Regression
Data- Continuous Data- Categorical- ANOVA
Data- Categorical Data- Continuous- (classification)
Data-categorical, >2 response variable Data- Continuous- Poisson distribution
*Data= discrete variable Data= discrete variable-
Categorical analysis
* Categorical analysis followed by contingency table (for testing the hypothesis, standard Chi-
Definition of level of significance in other way
Level of significance means we are allowing Type-1 error (not Type-11 error that is most
dangerous) up to a certain limit using 5 percent, 1 percent or 10 percent of data. Here we assume
that there is possible of error of 5 percent data, 1 percent data or 10 percent data. For example, let
we have 100 data points, here we are taking 5% level of significance, that means we are allowing
5 points data out of 100 data points that might be affected by Type-1 error (not Type- 11 error).
In other words, tolerance level of researcher on his data. That means how many data points he is
sacrificing for Type -1 error.
Class –Simple and multiple regression-SMB
Some conceptually things are discussed as follow.
We say association when we will have categorical variables
Similarly when we have continuous variable we say relation between variables.
Odd Ratio- meaning- Ratio of probability occurred and probability does not occur
There are three ways we can determine the dependent variable of categorical data, like 1.
Absolute values, 2. Relative risk, 3. Odd ratio.
For drawing the inference of the logistic regression, the formal test like Pearson Chi-square test
is used to test the significant issue of logistic regression.
In logistic regression, response variable is 0, 1 (binary, Yes or No), but predictor variables are
continuous. The dependent variable of logistic regression is log odd ratio instead of simple odd
Time series- Just two lines- Difference between additive model and multiplicative model.
When fluctuation, is not constant, use the multiplicative model, otherwise when there is not
constant fluctuation, use additive model in time series.
Class –Multivariate method-SMB
We generally face in day to day life the problem of classification. For example, whether that
customer has capacity to pay loan or not, whether that person will get cured if he continues the
particular medicine, whether I should visit to that doctor, like this. These are examples of ‘yes”
or “no” question. This type of classification problem, we call it supervised classification
problem. This means we know already there is two possibilities i.e., yes or no like this. Besides
the yes/no questions, there are certain classification problems in real life we face that is very
difficult to classify. For example, after securing good percentage in higher secondary (+2
science), whether she/he should go for engineering course, medical course, basic science course,
CEPET course, B.math course, B.stat course, certain type of animation course like this. This
means where multiple classification is possible to take the decision. This type of classification
problem, we call it unsupervised classification problem. Here we can’t convert into ‘yes’ or ‘No’
For supervised classification problem, we generally deal with conjoint analysis.
Discriminant may be linear or quadrdic.
For unsupervised problem, we deal with principle component analysis, cluster analysis
For principle component analysis- data set in R- USArrests
This material is completely written by Ramesh Chandra Das, Research Scholar at IIT kharagpur.
If any error is there pls let me know and will try to rectify as per my best of my capapcity.
He can be reached on e-mail address- rameshchandradas99@gmail.com
Comments
Post a Comment