Univariate Analysis
Lecturer
Note……….
VARIABLE ANALYSIS
v What is Univariate Analysis?
Univariate analysis is the simplest form of analyzing
data. “Uni” means “one”, so in other words your data has only one variable. It
doesn’t deal with causes or relationships (unlike regression) and it’s major
purpose is to describe; It takes data, summarizes that data and finds patterns
in the data.
This type of data consists
of only one variable. The analysis of univariate data is thus the simplest form
of analysis since the information deals with only one quantity that changes. It
does not deal with causes or relationships and the main purpose of the analysis
is to describe the data and find patterns that exist within it. The example of a
univariate data can be height.
Table 1
Suppose that the heights of seven students of a class
is recorded (Table 1 ), there is only one variable that is height and it is not
dealing with any cause or relationship. The description of patterns found in
this type of data can be made by drawing conclusions using central tendency
measures (mean, median and mode), dispersion or spread of data (range, minimum,
maximum, quartiles, variance and standard deviation) and by using frequency
distribution tables, histograms, pie charts, frequency polygon and bar charts.
What is a variable in Univariate Analysis?
A variable in univariate analysis is just a condition
or subset that your data falls into. You can think of it as a “category.” For
example, the analysis might look at a variable of “age” or it might look at “sales”,
“height” or “weight”. However, it doesn’t look at more than one variable at a
time otherwise it becomes bivariate analysis (or in the case of 3 or more
variables it would be called multivariate analysis).
The following frequency distribution table (Table 2)
shows one variable (left column) and the count in the right column.
Table 2 Weekly sales
|
Sales
(weekly) |
Amount (Rs) |
|
Monday |
25000 |
|
Tuesday |
22000 |
|
Wednesday |
23000 |
|
Thursday |
22000 |
|
Friday |
23000 |
|
Saturday |
21000 |
|
Sunday |
26000 |
|
Total |
162000 |
You could have more than one variable in the above
table. For example, you could add the variable “Location”or “Age” or something
else, and make a separate column for location or age. In that case you would
have bivariate data because you would then have two variables.
Univariate Descriptive Statistics
Some ways you can describe patterns found in univariate
data include central tendency (mean, mode and median) and dispersion: range,
variance, maximum, minimum, quartiles (including the interquartile range), and
standard deviation.
You have several options for describing data with univariate
data. To represent the above date, there are following way as follows.
Frequency Distribution Tables.
Bar Charts.
Histograms.
Frequency Polygons.
Pie Charts.
v What is bivariate analysis?
Bivariate analysis is one of the statistical analysis
where two variables are observed. One variable here is dependent while the
other is independent. These variables are usually denoted by X and Y. So, here
we analyse the changes occured between the two variables and to what extent.
Apart from bivariate, there are other two statistical analyses, which are
Univariate (for one variable) and Multivariate (for multiple variables).
Correlation and Regression
Covariance
Definition of Bivariate Analysis
Bivariate analysis is stated to be an analysis of any
concurrent relation between two variables or attributes. This study explores
the relationship of two variables as well as the depth of this relationship to
figure out if there are any discrepancies between two variables and any causes
of this difference. Some of the examples are percentage table, scatter plot,
etc.
This type of data involves two different variables. The
analysis of this type of data deals with causes and relationships and the
analysis is done to find out the relationship among the two variables. Example
of bivariate data can be temperature and ice cream sales in summer season
Table 3
Suppose the temperature and ice cream sales are the two
variables of a bivariate data (Table 3). Here, the relationship is visible from
the table that temperature and sales are directly proportional to each other
and thus related because as the temperature increases, the sales also increase.
Thus bivariate data analysis involves comparisons, relationships, causes and
explanations. These variables are often plotted on X and Y axis on the graph
for better understanding of data and one of these variables is independent
while the other is dependent.
It can be represented in better using scatter plot.
v What is multivariate analysis?
When the data involves three or more variables, it is
categorized under multivariate. Example of this type of data is suppose an
advertiser wants to compare the popularity of four advertisements on a website,
then their click rates could be measured for both men and women and
relationships between variables can then be examined.
It is similar to bivariate but contains more than one
dependent variable. The ways to perform analysis on this data depends on the
goals to be achieved. Some of the techniques are regression analysis, path analysis,
factor analysis and multivariate analysis of variance (MANOVA).
Multivariate analysis is used to study more complex
sets of data than what Univariate analysis methods can handle. This type of
analysis is almost always performed with software (i.e. SPSS or SAS, R, Gretl
etc), as working with even the smallest of data sets can be overwhelming by
hand.
Multivariate analysis can reduce the likelihood of Type
I errors. Sometimes, univariate analysis is preferred as multivariate
techniques can result in difficulty interpreting the results of the test. For
example, group differences on a linear combination of dependent variables in
MANOVA can be unclear. In addition, multivariate analysis is usually unsuitable
for small sets of data.
There are more than 20 different ways to perform
multivariate analysis. Which one you choose depends upon the type of data you
have and what your goals are. For example, if you have a single data set you
have several choices:
Additive trees, multidimensional scaling, cluster
analysis are appropriate for when the rows and columns in your data table
represent the same units and the measure is either a similarity or a distance.
Principal component analysis (PCA) decomposes a data
table with correlated measures into a new set of uncorrelated measures.
Correspondence analysis is similar to PCA. However, it
applies to contingency tables.
Although there are fairly clear boundaries with one data
set (for example, if you have a single data set in a contingency table your
options are limited to correspondence analysis), in most cases you’ll be able
to choose from several methods. A few examples of multivariate techniques are
as follow:
Additive Tree.
Canonical Correlation Analysis.
Cluster Analysis.
Correspondence Analysis / Multiple Correspondence
Analysis.
Factor Analysis.
Independent Component Analysis.
MANOVA.
Multidimensional Scaling.
Multiple Regression Analysis.
Partial Least Square Regression.
Principal Component Analysis / Regression / PARAFAC.
Redundancy Analysis.
DATA STRUCTURE
v What is cross sectional data?
Cross sectional data is a part of the cross sectional
study. A cross sectional data is data collected by observing various subjects
like (firms, countries, regions, individuals), at the same point in time. A
cross sectional data is analyzed by comparing the differences within the
subjects.
Basically, Cross sectional is a data which is collected
from all the participants at the same time. Time is not considered as a study
variable during cross sectional research. Though, this is also a fact that,
during a cross sectional study, all the participants don’t give the information
at the same moment.
Cross sectional data is collected from the participants
within a shorter time frame. This time frame is also known as field period.
Time only produces a variance in the results, but it’s not biased.
If you expand your data collection process to involve
daily sales revenue and expenses over a span of time of few months, you will
now be having a time series for expenditures and sales.
Cross sectional data example
Take an example. Say, suppose you want to measure
current blood pressure levels in a population. 1000 people will be selected
randomly from that population. It is also called a cross section of that
particular population range). Now, their Blood Pressure will be measured. Their
height, weight and other health factors will also be noted.
This cross sectional data provides you with a snapshot
of that population. This data will only provide the current proportion of the
Blood pressure levels. On the basis of just one cross sectional sample, you
can’t judge whether the rate of Blood pressure raising is low or high. But, it
will surely give you an idea of the scenario.
Another cross sectional data example can be a cross
sectional study performed on the variations of ice cream flavours at a
particular store and how people are responding to those flavours. You can also
obtain cross sectional data from a list of grades scored by a class of students
on a particular test.
Data collected on sales revenue, sales volume, expenses
for the last month and number of customers at a particular coffee shop. This is
also a type of cross-sectional data. If you expand your data collection process
to involve daily sales revenue and expenses over a span of time of a few
months, you will now be having a time series for costs and sales.
v What is a Time Series data?
A time series is a sequence of numerical data points in
successive order. In investing, a time series tracks the movement of the chosen
data points, such as a security’s price, over a specified period of time with
data points recorded at regular intervals. There is no minimum or maximum
amount of time that must be included, allowing the data to be gathered in a way
that provides the information being sought by the investor or analyst examining
the activity.
[Important: Time
series analysis can be useful to see how a given asset, security, or economic
variable changes over time.]
- Understanding
Time Series
A time series can be taken on any variable that changes
over time. This can be tracked over the short term, such as the price of a
security on the hour over the course of a business day, or the long term, such
as the price of a security at close on the last day of every month over the
course of five years.
- Time
Series Analysis
Time series analysis can be useful to see how a given
asset, security, or economic variable changes over time. It can also be used to
examine how the changes associated with the chosen data point compare to shifts
in other variables over the same time period.
For example, suppose you wanted to analyze a time
series of daily closing stock prices for a given stock over a period of one
year. You would obtain a list of all the closing prices for the stock from each
day for the past year and list them in chronological order. This would be a
one-year daily closing price time series for the stock.
Delving a bit deeper, you might analyze time series
data with technical analysis tools to know whether the stock's time series
shows any seasonality. This will help to determine if the stock goes through
peaks and troughs at regular times each year. Analysis in this area would
require taking the observed prices and correlating them to a chosen season.
This can include traditional calendar seasons, such as summer and winter, or retail
seasons, such as holiday seasons.
Alternatively, you can record a stock's share price
changes as it relates to an economic variable, such as the unemployment rate.
By correlating the data points with information relating to the selected
economic variable, you can observe patterns in situations exhibiting dependency
between the data points and the chosen variable.
- Time
Series Forecasting
Time series forecasting uses information regarding
historical values and associated patterns to predict future activity. Most
often, this relates to trend analysis, cyclical fluctuation analysis, and
issues of seasonality. As with all forecasting methods, success is not
guaranteed.
Time Series Analysis is used for many applications such
as:
Economic Forecasting
Sales Forecasting
Budgetary Analysis
Stock Market Analysis
Yield Projections
Process and Quality Control
Inventory Studies
Workload Projections
Utility Studies
Census Analysis
v Cross Sectional Data vs. Time Series
Data
Data comes in various sizes and shapes. This data
measures many things at different times. Well, both time-series data and
cross-sectional data are a specific interest of financial analysts.
Various methods are used to analyze different types of
data. It is, therefore, crucial to be able to identify both time series and
cross sectional data sets. Let’s discuss both one by one and analyze the
difference between both.
- Cross
sectional data
These are the observations which come from different
groups or individuals at a single point of time. The underlying population
should have members with similar characteristics. For example, if you want to
know how many companies are spending their money on development and research?
Some of the companies spend less amount, and some spend
a lot of research and development. This will provide different data as there
are various companies belonging to different groups. Rather, you can analyze
the companies belonging to a similar group and then do a cross sectional
analyses on them. Let’s now talk about
Time-series data.
- Time-series
data
These are observations which are collected at equally
spaced time intervals. For example, you can consider the daily closing price of
a particular stock recorded over the past four weeks. One thing is to be noted,
and that is, too short or too long time can lead towards time bias.
Other examples of Time series data can be, weekly sales
graph of an ice-cream sold during a holiday period at some shop. Another
example can be, staff numbers noted at a college, which was taken on a monthly
basis. It was done to assess the turnover rates of the staff. These examples
can be used to showcase data patterns in the near future.
Let’s make it easier to understand. When the data is
collected for the same variable over time, like months, years, then this type
of data is called as time-series data. The data might be collected over months,
years, but virtually, any time interval can be seen.
- Uses
of Cross sectional data
Cross-sectional data is used in differential equations
and statistical techniques. Primarily it is used for cross-sectional
regression. It is a kind of regression analysis for this data. For instance,
each individuals’ usage expenditure in a specific month can be regressed on the
basis of different aspects.
These aspects can be; incomes, wealth and their various
demographic characteristics as well. It is to judge how distinctions amongst
those characteristics, result in the ultimate behaviour of the consumers.
- Some
practical examples of cross-sectional data
ü Cross-sectional
datasets are utilized mostly in Finance, economics and various areas of social
sciences.
ü In
applied microeconomics, cross-sectional data is used to study labour markets,
public funds, industrial organization theory, and health finance.
ü Political
researchers utilize cross-sectional information to break down demography and
electoral engagements.
ü Cross-sectional
data also has a role in comparing financial statements of two or more
companies. Financial analysts carry out this job. In a cross-sectional
analysis, the comparison is carried out at the same time. Whereas in time
series data analysis, a comparison between the financial statement of the
company takes place in several time periods.
ü In
retail, cross-sectional data plays a significant role. It can study the
expenditure trend of males and females of any age group.
ü In
business, cross-sectional data can be used to study the response to a single
change from people coming from different socio-economic status from a specific
geographical section.
ü In
medical and healthcare also cross-sectional data can be used to analyze how
many kids of age between 4 to 14 are prone to low calcium deficiency.
ü Cross-sectional
data allows collecting large information that further helps in quick decision
making.
The concept behind Rolling Cross-Section
In case of a rolling cross-section, the existence of an
individual in a sample along with the time period at which he was enrolled in
that sample, both are defined through random techniques.
The individual is chosen through a random technique
from the existing population. After the selection, a random date is allotted to
each. It is a random data on whose basis an individual’s interview is conducted
and hence, it is made a part of the survey.
v What is Panel data?
Panel data, also known as longitudinal data or
cross-sectional time series data in some special cases, is data that is derived
from a (usually small) number of observations over time on a (usually large)
number of cross-sectional units like individuals, households, firms, or
governments.
In the disciplines of econometrics and statistics,
panel data refers to multi-dimensional data that generally involves
measurements over some period of time. As such, panel data consists of
researcher's observations of numerous phenomena that were collected over
several time periods for the same group of units or entities. For example, a
panel data set may be one that follows a given sample of individuals over time
and records observations or information on each individual in the sample.
Basic Examples of Panel Data Sets
The following are very basic examples of two panel data
sets for two to three individuals over the course of several years in which the
data collected or observed includes income, age, and sex:
Panel Data Set A
|
Person |
Year |
Income
(Rs) |
Age |
Sex |
|
1 |
2016 |
20000 |
23 |
F |
|
1 |
2017 |
25000 |
24 |
F |
|
1 |
2018 |
27500 |
25 |
F |
|
2 |
2016 |
35000 |
27 |
M |
|
2 |
2017 |
42500 |
28 |
M |
|
2 |
2018 |
50000 |
29 |
M |
Panel Data Set B
|
Person |
Year |
Income
(Rs) |
Age |
Sex |
|
1 |
2016 |
20000 |
23 |
F |
|
1 |
2017 |
25000 |
24 |
F |
|
2 |
2016 |
27500 |
25 |
M |
|
2 |
2017 |
35000 |
27 |
M |
|
2 |
2018 |
42500 |
28 |
M |
|
3 |
2016 |
50000 |
29 |
F |
Both Panel Data Set A and Panel Data Set B above show
the data collected (the characteristics of income, age, and sex) over the
course of several years for different people. Panel Data Set A shows the data
collected for two people (person 1 and person 2) over the course of three years
(2016, 2017, and 2018). This example data set would be considered a balanced
panel because each person is observed for the defined characteristics of
income, age, and sex each year of the study. Panel Data Set B, on the other hand,
would be considered an unbalanced panel as data does not exist for each person
each year. Characteristics of person 1 and person 2 were collected in 2016 and
2017, but person 3 is only observed in 2016, not 2017 and 2018.
- Analysis
of Panel Data in Economic Research
There are two distinct sets of information that can be
derived from cross-sectional time series data. The cross-sectional component of
the data set reflects the differences observed between the individual subjects
or entities whereas the time series component which reflects the differences
observed for one subject over time. For instance, researchers could focus on
the differences in data between each person in a panel study and/or the changes
in observed phenomena for one person over the course of the study (e.g., the
changes in income over time of person 1 in Panel Data Set A above).
It is panel data regression methods that permit
economists to use these various sets of information provided by panel data. As
such, analysis of panel data can become extremely complex. But this flexibility
is precisely the advantage of panel data sets for economic research as opposed
to conventional cross-sectional or time series data. Panel data gives
researchers a large number of unique data points, which increases the
researcher's degree of freedom to explore explanatory variables and
relationships.
- Example
for a balanced panel:
The CSO in India is a household, survey, with the same
size of 22,500 each quarter. Each household has to record its consumption expenditures
for 5 quarters. So each quarter 4500 members enter/leave the CSO. This is a balanced
panel.
v What is Pooled Data?
In pooled cross section, we will take random samples in
different time periods, of different units, i.e. each sample we take, will be
populated by different individuals. This is often used to see the impact of
policy or programmes. For example we will take household income data on
households X, Y and Z, in 2000. And then we will take the same income data on
households G, F and A in 2005. Although we are interested in the same data, we
are taking different samples (using different households) in different time
periods.
Pooling data refers to two or more independent data
sets of the same type.
- Pooled
time series
We observe e.g. return series of several sectors, which
are assumed to be independent of each other, together with explanatory
variables. The number of sectors, N, is usually small. Observations are viewed
as repeated measures at each point of time. So parameters can be estimated with
higher precision due to an increased sample size.
- Pooled
cross sections
Mostly these type of data arise in surveys, where
people are asked about e.g. their attitudes to political parties. This survey
is repeated, T times, before elections every week. T is usually small. So we
have several cross sections, but the persons asked are chosen randomly. Hardly
any person of one cross section is member of another one. The cross sections
are independent. Only overall questions can be answered, like the attitudes
within males or women, but no individual (even anonymous) paths can be
identified.
Comments
Post a Comment