Univariate Analysis

 

Lecturer Note……….

VARIABLE ANALYSIS

v What is Univariate Analysis?

Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn’t deal with causes or relationships (unlike regression) and it’s major purpose is to describe; It takes data, summarizes that data and finds patterns in the data.

This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. It does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it. The example of a univariate data can be height.

Table 1

Suppose that the heights of seven students of a class is recorded (Table 1 ), there is only one variable that is height and it is not dealing with any cause or relationship. The description of patterns found in this type of data can be made by drawing conclusions using central tendency measures (mean, median and mode), dispersion or spread of data (range, minimum, maximum, quartiles, variance and standard deviation) and by using frequency distribution tables, histograms, pie charts, frequency polygon and bar charts.

What is a variable in Univariate Analysis?

A variable in univariate analysis is just a condition or subset that your data falls into. You can think of it as a “category.” For example, the analysis might look at a variable of “age” or it might look at “sales”, “height” or “weight”. However, it doesn’t look at more than one variable at a time otherwise it becomes bivariate analysis (or in the case of 3 or more variables it would be called multivariate analysis).

The following frequency distribution table (Table 2) shows one variable (left column) and the count in the right column.

Table 2 Weekly sales 

Sales (weekly)

Amount (Rs)

Monday

25000

Tuesday

22000

Wednesday

23000

Thursday

22000

Friday

23000

Saturday

21000

Sunday

26000

Total

162000

You could have more than one variable in the above table. For example, you could add the variable “Location”or “Age” or something else, and make a separate column for location or age. In that case you would have bivariate data because you would then have two variables.

Univariate Descriptive Statistics

Some ways you can describe patterns found in univariate data include central tendency (mean, mode and median) and dispersion: range, variance, maximum, minimum, quartiles (including the interquartile range), and standard deviation.

You have several options for describing data with univariate data. To represent the above date, there are following way as follows.

Frequency Distribution Tables.

Bar Charts.

Histograms.

Frequency Polygons.

Pie Charts.

v What is bivariate analysis?

Bivariate analysis is one of the statistical analysis where two variables are observed. One variable here is dependent while the other is independent. These variables are usually denoted by X and Y. So, here we analyse the changes occured between the two variables and to what extent. Apart from bivariate, there are other two statistical analyses, which are Univariate (for one variable) and Multivariate (for multiple variables).

Correlation and Regression

Covariance

Definition of Bivariate Analysis

Bivariate analysis is stated to be an analysis of any concurrent relation between two variables or attributes. This study explores the relationship of two variables as well as the depth of this relationship to figure out if there are any discrepancies between two variables and any causes of this difference. Some of the examples are percentage table, scatter plot, etc.

This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables. Example of bivariate data can be temperature and ice cream sales in summer season

Table 3

Suppose the temperature and ice cream sales are the two variables of a bivariate data (Table 3). Here, the relationship is visible from the table that temperature and sales are directly proportional to each other and thus related because as the temperature increases, the sales also increase. Thus bivariate data analysis involves comparisons, relationships, causes and explanations. These variables are often plotted on X and Y axis on the graph for better understanding of data and one of these variables is independent while the other is dependent.

It can be represented in better using scatter plot.

v What is multivariate analysis?

When the data involves three or more variables, it is categorized under multivariate. Example of this type of data is suppose an advertiser wants to compare the popularity of four advertisements on a website, then their click rates could be measured for both men and women and relationships between variables can then be examined.

It is similar to bivariate but contains more than one dependent variable. The ways to perform analysis on this data depends on the goals to be achieved. Some of the techniques are regression analysis, path analysis, factor analysis and multivariate analysis of variance (MANOVA).

Multivariate analysis is used to study more complex sets of data than what Univariate analysis methods can handle. This type of analysis is almost always performed with software (i.e. SPSS or SAS, R, Gretl etc), as working with even the smallest of data sets can be overwhelming by hand.

Multivariate analysis can reduce the likelihood of Type I errors. Sometimes, univariate analysis is preferred as multivariate techniques can result in difficulty interpreting the results of the test. For example, group differences on a linear combination of dependent variables in MANOVA can be unclear. In addition, multivariate analysis is usually unsuitable for small sets of data.

There are more than 20 different ways to perform multivariate analysis. Which one you choose depends upon the type of data you have and what your goals are. For example, if you have a single data set you have several choices:

Additive trees, multidimensional scaling, cluster analysis are appropriate for when the rows and columns in your data table represent the same units and the measure is either a similarity or a distance.

Principal component analysis (PCA) decomposes a data table with correlated measures into a new set of uncorrelated measures.

Correspondence analysis is similar to PCA. However, it applies to contingency tables.

Although there are fairly clear boundaries with one data set (for example, if you have a single data set in a contingency table your options are limited to correspondence analysis), in most cases you’ll be able to choose from several methods. A few examples of multivariate techniques are as follow:


Additive Tree.

Canonical Correlation Analysis.

Cluster Analysis.

Correspondence Analysis / Multiple Correspondence Analysis.

Factor Analysis.

Independent Component Analysis.

MANOVA.

Multidimensional Scaling.

Multiple Regression Analysis.

Partial Least Square Regression.

Principal Component Analysis / Regression / PARAFAC.

Redundancy Analysis.


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DATA STRUCTURE

v What is cross sectional data?

Cross sectional data is a part of the cross sectional study. A cross sectional data is data collected by observing various subjects like (firms, countries, regions, individuals), at the same point in time. A cross sectional data is analyzed by comparing the differences within the subjects.

Basically, Cross sectional is a data which is collected from all the participants at the same time. Time is not considered as a study variable during cross sectional research. Though, this is also a fact that, during a cross sectional study, all the participants don’t give the information at the same moment.

Cross sectional data is collected from the participants within a shorter time frame. This time frame is also known as field period. Time only produces a variance in the results, but it’s not biased.

If you expand your data collection process to involve daily sales revenue and expenses over a span of time of few months, you will now be having a time series for expenditures and sales.

Cross sectional data example

Take an example. Say, suppose you want to measure current blood pressure levels in a population. 1000 people will be selected randomly from that population. It is also called a cross section of that particular population range). Now, their Blood Pressure will be measured. Their height, weight and other health factors will also be noted.

This cross sectional data provides you with a snapshot of that population. This data will only provide the current proportion of the Blood pressure levels. On the basis of just one cross sectional sample, you can’t judge whether the rate of Blood pressure raising is low or high. But, it will surely give you an idea of the scenario.

Another cross sectional data example can be a cross sectional study performed on the variations of ice cream flavours at a particular store and how people are responding to those flavours. You can also obtain cross sectional data from a list of grades scored by a class of students on a particular test.

Data collected on sales revenue, sales volume, expenses for the last month and number of customers at a particular coffee shop. This is also a type of cross-sectional data. If you expand your data collection process to involve daily sales revenue and expenses over a span of time of a few months, you will now be having a time series for costs and sales.

v What is a Time Series data?

A time series is a sequence of numerical data points in successive order. In investing, a time series tracks the movement of the chosen data points, such as a security’s price, over a specified period of time with data points recorded at regular intervals. There is no minimum or maximum amount of time that must be included, allowing the data to be gathered in a way that provides the information being sought by the investor or analyst examining the activity.

 [Important: Time series analysis can be useful to see how a given asset, security, or economic variable changes over time.]

 

  • Understanding Time Series

A time series can be taken on any variable that changes over time. This can be tracked over the short term, such as the price of a security on the hour over the course of a business day, or the long term, such as the price of a security at close on the last day of every month over the course of five years.

  • Time Series Analysis

Time series analysis can be useful to see how a given asset, security, or economic variable changes over time. It can also be used to examine how the changes associated with the chosen data point compare to shifts in other variables over the same time period.

For example, suppose you wanted to analyze a time series of daily closing stock prices for a given stock over a period of one year. You would obtain a list of all the closing prices for the stock from each day for the past year and list them in chronological order. This would be a one-year daily closing price time series for the stock.

Delving a bit deeper, you might analyze time series data with technical analysis tools to know whether the stock's time series shows any seasonality. This will help to determine if the stock goes through peaks and troughs at regular times each year. Analysis in this area would require taking the observed prices and correlating them to a chosen season. This can include traditional calendar seasons, such as summer and winter, or retail seasons, such as holiday seasons.

Alternatively, you can record a stock's share price changes as it relates to an economic variable, such as the unemployment rate. By correlating the data points with information relating to the selected economic variable, you can observe patterns in situations exhibiting dependency between the data points and the chosen variable.

 

  • Time Series Forecasting

Time series forecasting uses information regarding historical values and associated patterns to predict future activity. Most often, this relates to trend analysis, cyclical fluctuation analysis, and issues of seasonality. As with all forecasting methods, success is not guaranteed.

Time Series Analysis is used for many applications such as:


Economic Forecasting

Sales Forecasting

Budgetary Analysis

Stock Market Analysis

Yield Projections

Process and Quality Control

Inventory Studies

Workload Projections

Utility Studies

Census Analysis


v  Cross Sectional Data vs. Time Series Data

Data comes in various sizes and shapes. This data measures many things at different times. Well, both time-series data and cross-sectional data are a specific interest of financial analysts.

Various methods are used to analyze different types of data. It is, therefore, crucial to be able to identify both time series and cross sectional data sets. Let’s discuss both one by one and analyze the difference between both.

 

 

  • Cross sectional data

These are the observations which come from different groups or individuals at a single point of time. The underlying population should have members with similar characteristics. For example, if you want to know how many companies are spending their money on development and research?

Some of the companies spend less amount, and some spend a lot of research and development. This will provide different data as there are various companies belonging to different groups. Rather, you can analyze the companies belonging to a similar group and then do a cross sectional analyses on them.  Let’s now talk about Time-series data.

  • Time-series data

These are observations which are collected at equally spaced time intervals. For example, you can consider the daily closing price of a particular stock recorded over the past four weeks. One thing is to be noted, and that is, too short or too long time can lead towards time bias.

Other examples of Time series data can be, weekly sales graph of an ice-cream sold during a holiday period at some shop. Another example can be, staff numbers noted at a college, which was taken on a monthly basis. It was done to assess the turnover rates of the staff. These examples can be used to showcase data patterns in the near future.

Let’s make it easier to understand. When the data is collected for the same variable over time, like months, years, then this type of data is called as time-series data. The data might be collected over months, years, but virtually, any time interval can be seen.

  • Uses of Cross sectional data

Cross-sectional data is used in differential equations and statistical techniques. Primarily it is used for cross-sectional regression. It is a kind of regression analysis for this data. For instance, each individuals’ usage expenditure in a specific month can be regressed on the basis of different aspects.

These aspects can be; incomes, wealth and their various demographic characteristics as well. It is to judge how distinctions amongst those characteristics, result in the ultimate behaviour of the consumers.

  • Some practical examples of cross-sectional data

ü  Cross-sectional datasets are utilized mostly in Finance, economics and various areas of social sciences.

ü  In applied microeconomics, cross-sectional data is used to study labour markets, public funds, industrial organization theory, and health finance.

ü  Political researchers utilize cross-sectional information to break down demography and electoral engagements.

ü  Cross-sectional data also has a role in comparing financial statements of two or more companies. Financial analysts carry out this job. In a cross-sectional analysis, the comparison is carried out at the same time. Whereas in time series data analysis, a comparison between the financial statement of the company takes place in several time periods.

ü  In retail, cross-sectional data plays a significant role. It can study the expenditure trend of males and females of any age group.

ü  In business, cross-sectional data can be used to study the response to a single change from people coming from different socio-economic status from a specific geographical section.

ü  In medical and healthcare also cross-sectional data can be used to analyze how many kids of age between 4 to 14 are prone to low calcium deficiency.

ü  Cross-sectional data allows collecting large information that further helps in quick decision making.

The concept behind Rolling Cross-Section

In case of a rolling cross-section, the existence of an individual in a sample along with the time period at which he was enrolled in that sample, both are defined through random techniques.

The individual is chosen through a random technique from the existing population. After the selection, a random date is allotted to each. It is a random data on whose basis an individual’s interview is conducted and hence, it is made a part of the survey.

v What is Panel data?

Panel data, also known as longitudinal data or cross-sectional time series data in some special cases, is data that is derived from a (usually small) number of observations over time on a (usually large) number of cross-sectional units like individuals, households, firms, or governments.

In the disciplines of econometrics and statistics, panel data refers to multi-dimensional data that generally involves measurements over some period of time. As such, panel data consists of researcher's observations of numerous phenomena that were collected over several time periods for the same group of units or entities. For example, a panel data set may be one that follows a given sample of individuals over time and records observations or information on each individual in the sample.

Basic Examples of Panel Data Sets

The following are very basic examples of two panel data sets for two to three individuals over the course of several years in which the data collected or observed includes income, age, and sex:

Panel Data Set A

Person

Year

Income (Rs)

Age

Sex

1

2016

20000

23

F

1

2017

25000

24

F

1

2018

27500

25

F

2

2016

35000

27

M

2

2017

42500

28

M

2

2018

50000

29

M

 

Panel Data Set B

Person

Year

Income (Rs)

Age

Sex

1

2016

20000

23

F

1

2017

25000

24

F

2

2016

27500

25

M

2

2017

35000

27

M

2

2018

42500

28

M

3

2016

50000

29

F

 

Both Panel Data Set A and Panel Data Set B above show the data collected (the characteristics of income, age, and sex) over the course of several years for different people. Panel Data Set A shows the data collected for two people (person 1 and person 2) over the course of three years (2016, 2017, and 2018). This example data set would be considered a balanced panel because each person is observed for the defined characteristics of income, age, and sex each year of the study. Panel Data Set B, on the other hand, would be considered an unbalanced panel as data does not exist for each person each year. Characteristics of person 1 and person 2 were collected in 2016 and 2017, but person 3 is only observed in 2016, not 2017 and 2018.

  • Analysis of Panel Data in Economic Research

There are two distinct sets of information that can be derived from cross-sectional time series data. The cross-sectional component of the data set reflects the differences observed between the individual subjects or entities whereas the time series component which reflects the differences observed for one subject over time. For instance, researchers could focus on the differences in data between each person in a panel study and/or the changes in observed phenomena for one person over the course of the study (e.g., the changes in income over time of person 1 in Panel Data Set A above).

It is panel data regression methods that permit economists to use these various sets of information provided by panel data. As such, analysis of panel data can become extremely complex. But this flexibility is precisely the advantage of panel data sets for economic research as opposed to conventional cross-sectional or time series data. Panel data gives researchers a large number of unique data points, which increases the researcher's degree of freedom to explore explanatory variables and relationships.

 

  • Example for a balanced panel:

The CSO in India is a household, survey, with the same size of 22,500 each quarter. Each household has to record its consumption expenditures for 5 quarters. So each quarter 4500 members enter/leave the CSO. This is a balanced panel.

v What is Pooled Data?

In pooled cross section, we will take random samples in different time periods, of different units, i.e. each sample we take, will be populated by different individuals. This is often used to see the impact of policy or programmes. For example we will take household income data on households X, Y and Z, in 2000. And then we will take the same income data on households G, F and A in 2005. Although we are interested in the same data, we are taking different samples (using different households) in different time periods.

Pooling data refers to two or more independent data sets of the same type.

  • Pooled time series

We observe e.g. return series of several sectors, which are assumed to be independent of each other, together with explanatory variables. The number of sectors, N, is usually small. Observations are viewed as repeated measures at each point of time. So parameters can be estimated with higher precision due to an increased sample size.

  • Pooled cross sections

Mostly these type of data arise in surveys, where people are asked about e.g. their attitudes to political parties. This survey is repeated, T times, before elections every week. T is usually small. So we have several cross sections, but the persons asked are chosen randomly. Hardly any person of one cross section is member of another one. The cross sections are independent. Only overall questions can be answered, like the attitudes within males or women, but no individual (even anonymous) paths can be identified.


Comments

Popular posts from this blog

Application of AM, GM, HM, Median and Mode

Earnings managment