Regression is used to analyze the impact of certain variables (i.e., the independent variables) on some other variables (i.e., the dependent variables). Simple linear regression is used when there is one dependent and one independent variable. Multiple linear regression (MLR) method is used when there is more than one independent variable. OLS, logistic and probit regression are used when dependent variable is binary. Random effect and fixed effect regression is used when analyzing panel data.

science statistic inference for regression

Image: “science statistics” by anacrus. License: CC0 1.0


Introduction

There are various types of regressions used in different scenarios and conditions. These various scenarios and the appropriate regression to be used in each of these scenarios are described as follows.

When There Is 1 Dependent Variable and 1 Independent Variable

When there is 1 dependent variable and 1 independent variable then we use simple linear regression. The SPSS output of simple linear regression is shown as follows.

Suppose age is the independent variable and earnings are the dependent variable.

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate
1 .374a .140 .109 1303.23184
  1. Predictors: (Constant), Age

ANOVAa

Model

Sum of Squares

df

Mean Square

F

Sig.

1 Regression

7730417

1

7730416.552

4.552

.042b

Residual

47555570

28

1698413.218

Total

55285987

29

  1. Dependent Variable: Earnings
  2. Predictors: (Constant), Age

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1 (Constant)

-956.437

1480.319

-.646

.523

Age

57.724

27.057

.374

2.133

.042

  1. Dependent Variable: Earnings

The R-squared of 0.374 indicates that 37.4 % of the variation in the dependent variable of this study, i.e., ‘earnings’, is explained by the independent variable of this model. The coefficient of age is 57.724. This implies that a 1-year rise in age leads to a 57.724 $ rise in the earnings; all other factors remain constant.

When There Is 1 Dependent Variable and Multiple Independent Variables

When there is 1 dependent variable and multiple independent variables then we use multiple linear regressions. The SPSS output of multiple linear regression is shown as follows.
Now suppose that ‘skill level’ and ‘experience’ also affect age, i.e., they are also the independent variables along with ‘age’.

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate
1 .230a .053 -.105 1597.82217
  1. Predictors: (Constant), Skill, Experience, Age

ANOVAa

Model

Sum of Squares

df

Mean Square

F

Sig.

1 Regression

2569739.329

3

856579.776

.336

.800b

Residual

45954642.489

18

2553035.694

Total

48524381.818

21

  1. Dependent Variable: Earnings
  2. Predictors: (Constant), Skill, Experience, Age

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1 (Constant)

-591.081

8790.423

-.067

.947

Age

48.324

172.219

.187

.281

.782

Experience

4.792

15.184

.073

.316

.756

Skill

62.696

1649.962

.025

.038

.970

  1. Dependent Variable: Earnings

The R-squared of 0.230 indicates that 23.0 % of the variation in the dependent variable of this study, i.e., ‘earnings’, is explained by the three independent variables of this model. The coefficient of age is 48.324. This implies that a 1-year rise in age leads to a 48.234 $ rise in the earnings; all other factors remain constant.

The coefficient of experience is 4.792. This implies that a 1-year rise in experience leads to a 4.792 $ rise in the earnings, all other factors remain constant. The coefficient of skill is 62.696. This implies that a 1 unit rise in skill score leads to a 62.696 $ rise in the earnings; all other factors remaining constant.

When the Dependent Variable Is a Binary Variable

When the dependent variable is a binary variable then we can use three types of regression:

  1. OLS,
  2. logistic regression and
  3. probit regression.

Suppose the binary dependent variable is ‘default’ which takes a value of 1 if the person defaults and takes a value of 0 if the person does not defaults. There are many factors that affects whether a person would default on a loan or not, e.g., income, family size, integrity level etc. The three SPSS regressions for this example are as follows. They all have different output and interpretations.

OLS

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate
1 .592a .351 .276 .43279
  1. Predictors: (Constant), Integrity level, Family size, Income

ANOVAa

Model

Sum of Squares

df

Mean Square

F

Sig.

1 Regression

2.630

3

.877

4.680

.010b

Residual

4.870

26

.187

Total

7.500

29

  1. Dependent Variable: Default
  2. Predictors: (Constant), Integrity level, Family size, Income

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1 (Constant)

1.205

.218

5.520

.000

Income

-.002

.002

-.388

-1.333

.194

Family size

-.083

.080

-.187

-1.033

.311

Integrity level

-.059

.156

-.118

-.381

.706

  1. Dependent Variable: Default

A 1 unit rise in ‘income’ leads to a 0.002 unit fall in the probability that the person would default; all other factors remain constant.

A 1 unit rise in ‘family size’ leads to a 0.083 unit fall in the probability that the person would default, all other factors remaining constant. A 1 unit rise in ‘integrity level’ leads to a 0.059 unit fall in the probability that the person would default; all other factors remain constant.

Logistic regression

Model Fitting Information

Model

Model Fitting Criteria

Likelihood Ratio Tests

-2 Log Likelihood

Chi-Square

df

Sig.

Intercept Only

35.233

Final

3.688

31.545

7

.000

Pseudo R-Square

Cox and Snell

.651

Nagelkerke

.867

McFadden

.758

Likelihood Ratio Tests

The chi-square statistic is the difference in -2 log-likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0.

Effect

Model Fitting Criteria

Likelihood Ratio Tests

-2 Log Likelihood of Reduced Model

Chi-Square

df

Sig.

Intercept

3.688a

0.000

0

Income

3.688

.000

2

1.000

Family size

9.923

6.235

3

.101

Integrity

4.193

.505

2

.777

  1. This reduced model is equivalent to the final model because omitting the effect does not increase the degrees of freedom.

Parameter Estimates

Default a

B

Std. Error

Wald

df

Sig.

Exp(B)

95 % Confidence Interval for Exp(B)

Lower Bound

Upper Bound

Intercept

1.099

1.155

.905

1

.341

[Income=100.00]

.000

3287.265

.000

1

1.000

1.000

0.000

.b

[Income=200.00]

-.189

9421.005

.000

1

1.000

.828

0.000

.b

[Income=300.00]

0c

0

[Family size=.00]

-.292

5117.214

.000

1

1.000

.747

0.000

.b

[Family size=1.00]

17.911

3287.265

.000

1

.996

60085080.228

0.000

.b

[Family size=2.00]

-.292

6133.191

.000

1

1.000

.747

0.000

.b

[Family size=3.00]

0c

0

[Integrity=1.00]

-19.010

0.000

1

5.548E-09

5.548E-09

5.548E-09

[Integrity=2.00]

17.104

8968.264

.000

1

.998

26809918.315

0.000

.b

[Integrity=4.00]

0c

0

  1. The reference category is: 1.00.
  2. Floating point overflow occurred while computing this statistic. Its value is therefore set to system missing.
  3. This parameter is set to zero because it is redundant.

A 1 unit rise in ‘income’ leads to a 3.688 unit rise in the log of the odds ratio that the person would default; all other factors remain constant. A 1 unit rise in ‘family size’ leads to a 9.923 unit rise in the log of the odds ratio that the person would default; all other factors remain constant. A 1 unit rise in ‘integrity level’ leads to a 4.193 unit rise in the log of the odds ratio that the person would default; all other factors remain constant.

The assumptions of logistic model are as follows. The logistic model accepts that information is case-particular; that is, every free factor has a solitary incentive for each case. The logistic model likewise expects that the response variable can’t be impeccably anticipated from the autonomous factors for any case. Similarly as with different sorts of relapse, there is no requirement for the autonomous factors to be measurably free from each other (dissimilar to, for instance, in a gullible Bayes classifier).

Be that as it may, collinearity is thought to be generally low, as it ends up plainly hard to separate between the effect of a few factors if this isn’t the case. In the event that the logistic is utilized to demonstrate decisions, it depends on the supposition of freedom of unimportant options, which isn’t generally alluring. This suspicion expresses that the chances of leaning toward one class over another don’t rely upon the nearness or nonappearance of other “insignificant” choices.

Probit regression

Probit Analysis

Parameter Estimates

Parameter

Estimate

Std. Error

Z

Sig.

95 % Confidence Interval

Lower Bound

Upper Bound

PROBIT a Family size

-.051

.077

-.656

.512

-.202

.101

Integrity level

-.347

.157

-2.207

.027

-.655

-.039

Intercept

-2.059

.207

-9.961

.000

-2.266

-1.852

  1. PROBIT model: PROBIT(p) = Intercept + BX

Covariances and Correlations of Parameter Estimates

Family size

Integrity level

PROBIT Family size

.006

-.482

Integrity level

-.006

.025

  1. Covariances (below) and Correlations (above)

Chi-Square Tests

Chi-Square

Df a

Sig.

PROBIT Pearson Goodness-of-Fit Test

28.625

27

.379

  1. Statistics based on individual cases differ from statistics based on aggregated cases.

Cell Counts and Residuals

Number

Family size

Integrity level

Number of Subjects

Observed Responses

Expected Responses

Residual

Probability

PROBIT 1

1.000

1.000

100

1

.701

.299

.007

2

1.000

1.000

100

1

.701

.299

.007

3

0.000

1.000

300

1

2.420

-1.420

.008

4

0.000

1.000

100

1

.807

.193

.008

5

0.000

1.000

100

1

.807

.193

.008

6

1.000

1.000

100

0

.701

-.701

.007

7

1.000

1.000

100

0

.701

-.701

.007

8

0.000

1.000

100

1

.807

.193

.008

9

2.000

1.000

100

1

.608

.392

.006

10

2.000

1.000

100

1

.608

.392

.006

11

2.000

1.000

100

1

.608

.392

.006

12

3.000

1.000

100

1

.527

.473

.005

13

3.000

1.000

100

1

.527

.473

.005

14

3.000

1.000

100

1

.527

.473

.005

15

3.000

1.000

100

1

.527

.473

.005

16

3.000

1.000

100

1

.527

.473

.005

17

3.000

2.000

200

0

.367

-.367

.002

18

3.000

2.000

200

0

.367

-.367

.002

19

3.000

2.000

200

0

.367

-.367

.002

20

3.000

2.000

200

0

.367

-.367

.002

21

3.000

2.000

200

0

.367

-.367

.002

22

1.000

2.000

200

0

.505

-.505

.003

23

3.000

2.000

200

0

.367

-.367

.002

24

3.000

2.000

200

0

.367

-.367

.002

25

3.000

2.000

200

0

.367

-.367

.002

26

3.000

2.000

300

0

.551

-.551

.002

27

3.000

4.000

300

0

.048

-.048

.000

28

3.000

4.000

300

0

.048

-.048

.000

29

3.000

4.000

300

0

.048

-.048

.000

30

3.000

4.000

300

1

.048

.952

.000

When the Data Is Panel Data

When the data on which the regression is to be applied is a panel data (i.e., data of multiple cross sectional units over a period of time) then we can use two types of regression:

  1. random effect model and
  2. fixed effect model.

Suppose we have data about education level and GDP of various countries over a 5-year period. This is a panel data. To analyze the impact of education level on GDP, we would have to run random effect regression and fixed effect regression in STATA.

Random effect model

Syntax: xtreg gdp education level, re
Random-effects GLS regression Number of obs =

25

Group variable: country Number of groups =

5

R-sq: within = 0.8848 Obs per group: min =

5

between = 1.0000 avg =

5

overall = 0.9948 max =

5

Wald chi2(1) =

4438.08

Correlation (u_i, X) = 0 (assumed) Prob > chi2 =

0

 

gdp Coef. Std. Error z P> [z] [95 % Conf. Intervall]
education level

_cons

127.8054

-14111.01

1.918454

1208.535

66.62

-11.68

0.000

0.000

124.0453

-16479.7

131.5655

-11742.33

sigma_u

sigma_e

rho

0

3043.5487

0

 

 

fraction of of variance to due u_i

 

 

fraction of of variance to due u_i

 

 

fraction of of variance to due u_i

 

 

fraction of of variance to due u_i

 

 

fraction of of variance to due u_i

A 1 unit rise in education level leads to a 127.8054 unit rise in the GDP; all other factors remain constant.

Fixed effect model

Syntax: xtreg gdp education level, fe
Fixed-effects (within) regression Number of obs =

25

Group variable: country Number of groups =

5

R-sq: within = 0.8848 Obs per group: min =

5

between = 1.0000 avg =

5

overall = 0.9948 max =

5

F(1,19) =

145.9

Correlation (u_i, Xb) = -0.9803 Prob > F =

0

 

gdp Coef. Std. Error t P> [t] [95 % Conf. Intervall]
education level

_cons

129.1728

-14876.79

10.69394

6019.463

12.08

-2.47

0.000

0.023

106.7902

-27475.67

151.5555

-2277.909

sigma_u

sigma_e

rho

449.94136

3043.5487

.0213876

 

 

fraction of of variance to due u_i

 

 

fraction of of variance to due u_i

 

 

fraction of of variance to due u_i

 

 

fraction of of variance to due u_i

 

 

fraction of of variance to due u_i

  1. F test that all u_i = 0: F (4, 19) = 0.0 ; Prob > F = 1.0000

A 1 unit rise in education level leads to a 129.1728 unit rise in the GDP; all other factors remain constant.

Do you want to learn even more?
Start now with 2,000+ free video lectures
given by award-winning educators!
Yes, let's get started!
No, thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *