ISI Workshop- Hand Note

ISI Workshop- 16th-20th , 2015. ISI campus, Kolkata

Class –Basic probability distribution- Prof. Arnab Chakrabarty

Random experiment follows the IID properties- here IID means independent identical

Example of IID- Let we circulate an questionnaire among the participants of ISI workshop

regarding performance of a particular teacher. In the first stage there will be no bias among the

participants of ISI for giving their choice regarding performance (good, bad, very good and

excellent).  In the second stage, if same questionnaire is circulated among the participants of

particular community of ISI for giving their choice regarding performance (good, bad, very good

and excellent). Pls mind it here the teacher is from same community, where participants are

going to give their choice regarding performance. In this case there will be bias in giving choice.

This is not IID. Because here the property of independent lost for giving choice. In the first case

the property of independent holds for giving choice.

Sample space is the set of possible outcomes.

Random values-Random experiment-Distribution [1. Standard distribution, 2. Non-standard

Here the distribution is 2 types [1. Discrete distribution, 2. Continuous distribution]

1. Discrete- Sample space is finite/countable- called probability mass function- follows the

summation for adding of total outcome/probability. Means to know the whole probability

we will have to add all the outcomes to get the probability. [whole probability is 1]

2. Continuous- It is follows the proportion- or in advance term called relative frequency.

Similarly, to know the total probability, we will have to integrate the total outcomes.

[whole probability is 1]

Difference between Discrete distribution and Continuous distribution

Difference between Discrete distribution and continuous distribution: Discrete has property of

countable and finite. We can count the observation like 1,2,3,4…….like this. The integer

property holds for the case of discrete distribution. In the case of continuous distribution,

it is infinite and uncountable. Where we can’t count it as discrete, rather we can measure

from certain point to certain point, means interval wise. For example- what is next of 1?

We can’t say 2 in the case of continuous distribution. But in the case of discrete

distribution, we can interpret as it in finite term. In the case of continuous distribution, the

word tens to 0,1,2……is used. The spreadness of normal distribution is from –infinite to

+ infinite. To measure the certain area in probability way, we will have to define from

which point to which point we are interested (That means in interval way). To know the

area of that curve we will have to use integration (for example definite integration). But

the same rule doesn’t hold in the case of discrete distribution, for example binomial

distribution, we are interested in success and failure of the events. Here we can easily

count, how many are success and how many are failure event. To know the probability of

that event, we will have to summate it (for example addition).…………Pls expand it or

The shape of the probability distribution has many forms-like normal, binomial, possion,

Here density of continuous probability distribution= Relative frequency/ Length of term

Probability of distribution [1. Standard probability distribution, 2. Non-standard probability

Definition of Probability distribution:-

Definition of Probability distribution of random values/random experiment x is a statement that

tells as 1. Sample space of x, 2. Probability that x has in any given subset of the sample space

Standard distribution [for discrete data – Binomial distribution, 2. Poisson distribution]

1. For visual display we will have to make a bar chart to know the distribution frequency

Standard distribution [for continuous data- Normal distribution, Log normal distribution]

1. For visual display we will have to make histogram to know the distribution frequency.

If data distribution belong to a standard family – follows the parametric distribution otherwise it

is called non-parametric distribution.

Advantages of standard parametric distribution

Recommended book by prof. Arnab Chakarborty (AC)- for probability distribution follow the

book Johnson and Kotz- you can find this book from free web book site-book.org or library

Class –Inference statistics-AB

Here X= is the random variable, a= is the specific value supported of random variable

Inference problem has two types [1. Estimation, 2. Hypothesis testing]

1. Estimation [ a. Point estimation, b. Interval estimation ]

A. Point estimation has

1. Method of moment

2. Method of maximum likelihood

3. Method of minimum chi-square

4. Bayes estimation method

Class –Inference statistics-SP

There are two types of sampling in research for collection of units from the population.

Here the population means the area/units you are focusing.

Here we have various sampling for collection of data are as follow-

1. Probability sampling

2. Non-probability sampling

Here I am mentioning some important sampling techniques that may be useful in your

Let us discuss about probability sampling

1. Simple random sampling- Definition you will get somewhere else. But basic thing –

every outcome should have equal chance of coming in every draws. This can be possible

A.  With replacement- formula = N^n

            B. Without replacement – formula = Ncn

Drawing sampling without replacement is more efficient than with replacement.

2. Systematic sampling- is basically two types

A. Linear systematic sampling (k = N/n (integer)

B. Circular systematic sampling (k =N/n (not integer) ….. for further understanding pls

see the full material

Probability of selecting systematic sampling = 1/k

To go through the details pls see the book as recommended by Madam (SP) :

Mahalanobis book)

Here statistical terms- Inclusion of probability and fuzzy probability- for details go through the

Inclusion of probability- The probability of selecting third unit in your sample is called inclusion

For collection of sampling units from your targeted population, you may go through the

following methods are as follow.

1. Hansen-Hauwitz-estimator

2. Horvitz Thompson – estimator

3. Ratio and regression method- estimator

 To convert the non-linear statistics to linear statistics, Taylor series expansion method is

 If estimator is non-linear, use the bootstrap method to compute the estimate and variance

 To judge the accuracy of sampling estimator, do cross-check using simulation technique.

Concept of P-vales Professor told a story that I am elaborating here- you may apply it

A bright example to make a valid question

Question- Is it too old? Here we can’t validate this type of question without knowing the

background information about the kind of pet we have with us.

First of all we need to know the background information about the pet we have with us like

whether that is cat, turtle or something else. If that is cat, then it is ok and question (Is it too old?)

is valid. If that is turtle, then question is not valid- because a turtle has long live. Before

proceeding we should give attention regarding the question we have interest to ask or to know.

R codes – here I am writing the code that is purely biased. Means what I am not comfortable.

R-console and R-editor file. In the R-console we can’t modify our command. But in the R-editor

file we can modify and change according to our comfortable.

Better use before going to workspace R-editor file for writing command.

In the R-editor file just select (Ctrl+R) and run the command. It will give the output on your

workspace. Any command after # is taken by R as function./command.

Some commands, here I am writing.

 pi- will give the value of pi

 x<-seq(-6,6,0.01)

 plot(x,y,type=”l”)- here l means line

 y1=dnorm(x,0,2)

 To know the base package of R= help(package=base)

 To get the current directory- getwd()

Important information- For installation of any packages in R, open the R-work software using

the administrator password. It would be reliable way in installing any packages in R. Just right

click on R-icon on your PC, there you will find an option of  “run by administrator” and click it.

Class –Simple and multiple regression-SMB

Book recommended – This purely conceptual and no mathematics for regression analysis-

Statistics by Freedman, Pisani and Purves

Here there is three terms that are used frequently in regression

a.Correlation, b. Association, c. Causation

 The correlation coefficient can be calculated only when there is linearity assumption

between two variables.

 Correlation does not mean causation- here causation means cause effect relationship-for

example if you hit the iron (cause), it would be flexible (effect).

 Correlation only measures linear association

 Zero correlation but non-linear association is possible.

Here I am writing an example that had given by professor

There is husband IQ-140, we predict wife IQ-120.We can’t predict wife IQ-120, taking husband

IQ-140 in reverse way. Yes it is possible only when there is perfect correlation between husband

IQ and wife IQ i.e., (+-) 1. Reverse way prediction is not possible only when there is no perfect

correlation between two variables (dependent and independent variable).

Table-1 Application of technique based on data properties

Response variable (Dependent variable)                    Predictor variable (Independent

Data - Continuous                                                                   Data-continuous- Regression

Data- Continuous                                                                    Data- Categorical- ANOVA

Data-  Categorical                                          Data- Continuous-  (classification)                                                                                          

Data-categorical, >2 response variable                       Data- Continuous- Poisson distribution

*Data= discrete variable                                                               Data= discrete variable-

Categorical analysis

 * Categorical analysis followed by contingency table (for testing the hypothesis, standard Chi-

Definition of level of significance in other way

Level of significance means we are allowing Type-1 error (not Type-11 error that is most

dangerous) up to a certain limit using 5 percent, 1 percent or 10 percent of data. Here we assume

that there is possible of error of 5 percent data, 1 percent data or 10 percent data. For example, let

we have 100 data points, here we are taking 5% level of significance, that means we are allowing

5 points data out of 100 data points that might be affected by Type-1 error (not Type- 11 error).

In other words, tolerance level of researcher on his data. That means how many data points he is

sacrificing for Type -1 error.

Class –Simple and multiple regression-SMB

Some conceptually things are discussed as follow.

 We say association when we will have categorical variables

 Similarly when we have continuous variable we say relation between variables.

Odd Ratio- meaning- Ratio of probability occurred and probability does not occur

There are three ways we can determine the dependent variable of categorical data, like 1.

Absolute values, 2. Relative risk, 3. Odd ratio.

For drawing the inference of the logistic regression, the formal test like Pearson Chi-square test

is used to test the significant issue of logistic regression.

In logistic regression, response variable is 0, 1 (binary, Yes or No), but predictor variables are

continuous. The dependent variable of logistic regression is log odd ratio instead of simple odd

Time series- Just two lines- Difference between additive model and multiplicative model.

When fluctuation, is not constant, use the multiplicative model, otherwise when there is not

constant fluctuation, use additive model in time series.

Class –Multivariate method-SMB

We generally face in day to day life the problem of classification. For example, whether that

customer has capacity to pay loan or not, whether that person will get cured if he continues the

particular medicine, whether I should visit to that doctor, like this. These are examples of ‘yes”

or “no” question. This type of classification problem, we call it supervised classification

problem. This means we know already there is two possibilities i.e., yes or no like this. Besides

the yes/no questions, there are certain classification problems in real life we face that is very

difficult to classify. For example, after securing good percentage in higher secondary (+2

science), whether she/he should go for engineering course, medical course, basic science course,

CEPET course, B.math course, B.stat course, certain type of animation course like this. This

means where multiple classification is possible to take the decision. This type of classification

problem, we call it unsupervised classification problem. Here we can’t convert into ‘yes’ or ‘No’

 For supervised classification problem, we generally deal with conjoint analysis.

Discriminant may be linear or quadrdic.

 For unsupervised problem, we deal with principle component analysis, cluster analysis

 For principle component analysis- data set in R- USArrests

This material is completely written by Ramesh Chandra Das, Research Scholar at IIT kharagpur.

If any error is there pls let me know and will try to rectify as per my best of my capapcity.

He can be reached on e-mail address- rameshchandradas99@gmail.com

Comments

Popular posts from this blog

Application of AM, GM, HM, Median and Mode

Univariate Analysis

Earnings managment