Name: Devesan Govindasamy
Student id : D20124946
Course: TU059 M.Sc Data Science(Full time)

R_code

Report Word file

Section 1 – Research Questions

Question 1: Can we predict the global score of the student using their Quantitative Reasoning and Critical Reading?

Question 2: Can we predict the global score of the student using their Quantitative Reasoning and Critical Reading with School Nature as dummy variable?

Section 2 – Dataset

Output:

school_type analysis

School Nature analysis

Gender wise analysis

Revenue wise analysis

>
Observations:

  • School type analysis: It shows an outlier ‘Not apply’ which is in very few number only 5 records, so we can ignore those outliers.
  • School Nature analysis: It shows the classification of 12,411 records based on the school they attended before engineering. It can be observed there is only a slight difference between the public and private institutes.
  • Gender wise analysis: It shows the classification of male and female pursuing engineering. It can be observed Females are considerably less compared to male.

Section 3 - Results

Section 3.1 - Statistical evidence

Section 3.2 – Model 1

Research Question : Can we predict the global score of the student using their Quantitative Reasoning and Critical Reading?

Null Hypothesis: The values of g_sc cannot be predicted with the predictor variables(cr_pro, qr_pro) using Multiple linear Regression.

Alternate Hypothesis: : The values of g_sc can be predicted with the predictor variables(cr_pro, qr_pro) using Multiple linear Regression.

Descriptive Inspection:
> model1<-lm(s_perform$g_sc~s_perform$qr_pro+s_perform$cr_pro)

> stargazer(model1, type="text") 

=================================================
                            Dependent variable:     
                    -----------------------------
                                g_sc             
-------------------------------------------------
qr_pro                        0.380***           
                                (0.006)           
                                                    
cr_pro                        0.480***           
                                (0.005)           
                                                    
Constant                     103.484***          
                                (0.398)           
                                                    
-------------------------------------------------
Observations                   12,406            
R2                              0.712            
Adjusted R2                     0.712            
Residual Std. Error      12.401 (df = 12403)     
F Statistic         15,341.490*** (df = 2; 12403)
=================================================

> lm.beta(model1)

Call:
lm(formula = s_perform$g_sc ~ s_perform$qr_pro + s_perform$cr_pro)

Standardized Coefficients::
        (Intercept) s_perform$qr_pro s_perform$cr_pro 
        0.0000000        0.3725515        0.5740398

> vifmodel
s_perform$qr_pro s_perform$cr_pro 
        1.481366         1.481366 
> #Tolerance
> 1/vifmodel
s_perform$qr_pro s_perform$cr_pro 
        0.6750525        0.6750525
                                
                            
Result:

  • The p-value indicates our model is significant, so we can reject the null hypothesis and accept the alternate hypothesis that the values of g_sc can be predicted with the predictor variables(cr_pro, qr_pro) using Multiple linear Regression.
  • Increase marks in qr_pro and cr_pro appears to be associated with increase marks in g_sc.
  • The F statistic looks at whether the model as whole is statistically significant.
  • In our case adjusted r 2 value 0.712, which means around 71.2% of the variance g_sc can be explained by cr_pro and qr_pro.
  • Our linear regression model is Predicted g_sc = 103 + 0.38*qr_pro + 0.48*cr_pro
  • In our case using the standardised coefficents our model equation becomes: Predicted g_sc = 0 + 0.373*qr_pro + 0.574*cr_pro


Assumptions:

Output:
Cooks plot

Leverage

Ressidual Fit

Standard Residual Fit

QQ Error distribution

Histogram error distribution

Result:

  • It can be observed that all the points are less than 1, So we don't have to worry about the outliers
  • We can see that there is no pattern and points are equally distributed, therefore homoscedasticity is not an concern
  • Minimum and Maximum value is within the acceptable range(-3.29,+3.29) hence we do not have outliers. Though the red lines are slightly distorted but this is not a huge problem
  • We can see that the errors are normally distributed
  • Both collinearity and tolerance were with the acceptable limits (VIF < 2.5 and tolerance > 0.4). So, we have no issues with Multi collinearity.

Reporting Multiple linear Regression(Model 1):

Multiple regression analysis was conducted to determine the global scores(g_sc). Scores of Quantitative Reasoning and Critical Reading were used as predictor variables. Examination of the histogram, normal P-P plot of standardised residuals and the scatterplot of the dependent variable, and standardised residuals showed that some outliers existed. However, examination of the standardised residuals showed that some could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013). Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF < 2.5 ) as outlined in Tarling (2008). The scatterplot of standardised residuals showed that the data met the assumptions of homogeneity of variance and linearity. The data also meets the assumption of non-zero variances of the predictors.

Section 3.3 – Model 2

Research Question : Can we predict the global score of the student using their Quantitative Reasoning and Critical Reading with School Nature as dummy variable?

 

Differential effect

  • We use the variable school_nat as a predictor, since it is a categorical type we use it as a dummy variable to understand the differential effect of the global_scores of students studied in Private and Public schools.
  • Here, 0 (reference category, Private) and 1 (category of interest, Public)

Null Hypothesis The values of g_sc cannot be predicted with the predictor variables(cr_pro, qr_pro, school_nat) using Multiple linear Regression.

Alternate Hypothesis: : The values of g_sc can be predicted with the predictor variables(cr_pro, qr_pro, school_nat) using Multiple linear Regression.

Descriptive Inspection:

> model2<-lm(s_perform$g_sc~s_perform$qr_pro+s_perform$cr_pro+s_perform$school_nat)

> stargazer(model2, type="text") #Tidy output of all the required stats

=================================================
                            Dependent variable:     
                    -----------------------------
                                g_sc             
-------------------------------------------------
qr_pro                        0.368***           
                                (0.006)           
                                                    
cr_pro                        0.471***           
                                (0.005)           
                                                    
school_natPUBLIC              -5.134***          
                                (0.222)           
                                                    
Constant                     107.329***          
                                (0.424)           
                                                    
-------------------------------------------------
Observations                   12,406            
R2                              0.724            
Adjusted R2                     0.724            
Residual Std. Error      12.141 (df = 12402)     
F Statistic         10,847.740*** (df = 3; 12402)
=================================================
Note:                 *p< 0.1; **p< 0.05; ***p< 0.01

> lm.beta(model2)

Call:
lm(formula = s_perform$g_sc ~ s_perform$qr_pro + s_perform$cr_pro + 
    s_perform$school_nat)

Standardized Coefficients::
                (Intercept)           s_perform$qr_pro           s_perform$cr_pro s_perform$school_natPUBLIC 
                    0.0000000                  0.3614803                  0.5634178                 -0.1109039

            
Result:

  • The p-value indicates our model is significant, so we can reject the null hypothesis and accept the alternate hypothesis that the values of g_sc can be predicted with the predictor variables(cr_pro, qr_pro,school_nat) using Multiple linear Regression.
  • Increased marks in qr_pro and cr_pro appears to be associated with increase marks in g_sc.
  • The F statistic looks at whether the model as whole is statistically significant.
  • Our linear regression model is Predicted g_sc = 103 + 0.368*qr_pro + 0.471*cr_pro – 5.134*school_natPUBLIC
  • In our case using the standardised coefficients our model equation becomes: Predicted g_sc = 0 + 0.361*qr_pro + 0.563*cr_pro - 0.11* school_natPUBLIC
  • When we calculate the equation for Public and Private schools:
    Public = 0+0.361+0.563-0.11=0.814
    Private=0+0.361+0.563 =0.924
  • We can see a small difference between the group yes and no.
  • We can see a slight change in the coefficent compared to model 1.
  • In our case adjusted r 2 value 0.724, which means around 72.4% of the variance g_sc can be explained by cr_pro and qr_pro when a dummy variable school_nat is added. An increase compared to model 1.


Assumptions:

Output:
Cooks plot

Leverage

Ressidual Fit

Standard Residual Fit

QQ Error distribution

Histogram error distribution
Result:

  • It can be observed that all the points are less than 1, So we don't have to worry about the outliers
  • We can see that there is no pattern and points are equally distributed, therefore homoscedasticity is not an concern
  • Minimum and Maximum value is within the acceptable range(-3.29,+3.29) hence we do not have outliers. Though the red lines are slightly distorted but this is not a huge problem
  • We can see that the errors are normally distributed
  • Both collinearity and tolerance were with the acceptable limits (VIF < 2.5 and tolerance > 0.4). So, we have no issues with Multi collinearity.
  • The differential effect in the global scores can be see between the students who studied in Public and Private.

Reporting Multiple linear Regression(Model 2):

Multiple regression analysis was conducted to determine the global scores(g_sc). Scores of Quantitative Reasoning and Critical Reading were used as predictor variables. In order to include the School Nature in the regression model it was recorded dummy variable school_nat (0 for Private, 1 for Public).Examination of the histogram, normal P-P plot of standardised . Examination of the histogram, normal P-P plot of standardised residuals and the scatterplot of the dependent variable, and standardised residuals showed that some outliers existed. However, examination of the standardised residuals showed that some could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013). Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF < 2.5 ) as outlined in Tarling (2008). The scatterplot of standardised residuals showed that the data met the assumptions of homogeneity of variance and linearity. The data also meets the assumption of non-zero variances of the predictors.

Section 3.3 – Model comparison

Descriptive Inspection:

> stargazer(model1, model2, type="text") 

===============================================================================
                                        Dependent variable:                    
                    -----------------------------------------------------------
                                                g_sc                            
                                    (1)                           (2)             
-------------------------------------------------------------------------------
qr_pro                        0.380***                      0.368***           
                                (0.006)                       (0.006)           
                                                                                
cr_pro                        0.480***                      0.471***           
                                (0.005)                       (0.005)           
                                                                                
school_natPUBLIC                                            -5.134***          
                                                                (0.222)           
                                                                                
Constant                     103.484***                    107.329***          
                                (0.398)                       (0.424)           
                                                                                
-------------------------------------------------------------------------------
Observations                   12,406                        12,406            
R2                              0.712                         0.724            
Adjusted R2                     0.712                         0.724            
Residual Std. Error      12.401 (df = 12403)           12.141 (df = 12402)     
F Statistic         15,341.490*** (df = 2; 12403) 10,847.740*** (df = 3; 12402)
===============================================================================
Note:                                               *p< 0.1; **p < 0.05; ***p < 0.01
                                
                            
Result:

  • Both the models are significant.
  • Model 1 adjusted R2 is 0.712 and model 2 adjusted R2 is 0.724
  • The adjusted R2 has changed by 1% after adding the dummy variable.
  • Adding the "students_nat" variable decreased the F-statistic of the model 1 and thus making model 2 more reliable
  • Beta values of the independent variables qr_pro(Model 1 = 0.38, Model 2 = 0.368), cr_pr(Model 1 = 0.480, Model 2 = 0.471), school_nat(Model 2 = -5.13).
  • The coefficents have decreased slightly in model 2 compared to model 1

Section 4 – Discussion/Conclusion