Regression Review Class

Πρόβλημα 1

load("prob1.Rdata")

(1.1)

model1=lm(density~distance, data=prob1)
model1

## 
## Call:
## lm(formula = density ~ distance, data = prob1)
## 
## Coefficients:
## (Intercept)     distance  
##    1.211973     0.003761

model1s=summary(model1)
model1s

## 
## Call:
## lm(formula = density ~ distance, data = prob1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.212753 -0.047247 -0.009136  0.062975  0.169705 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.2119730  0.0324376  37.363  < 2e-16 ***
## distance    0.0037609  0.0007954   4.728 7.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09813 on 25 degrees of freedom
## Multiple R-squared:  0.4721, Adjusted R-squared:  0.4509 
## F-statistic: 22.35 on 1 and 25 DF,  p-value: 7.54e-05

To p-value του F-test και του t-test για το συνελεστή του distance είναι ίσο με \(7.5\cdot 10^{-5}\), επομένως η συσχέτιση μεταξύ απόστασης και πυκνότητας είναι στατιστικά ισχυρά σημαντική.

(1.3)

p1new=data.frame(cbind(NA,18,NA))
names(p1new)=names(prob1)
prediction=predict(model1, newdata=p1new, interval="prediction")
prediction

##        fit      lwr      upr
## 1 1.279669 1.072366 1.486971

(1.4)

rjack=rstudent(model1)
summary(rjack)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.45033 -0.51048 -0.09590 -0.01419  0.64705  1.84606

lev=hatvalues(model1)
summary(lev)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03787 0.05459 0.05775 0.07407 0.09482 0.14935

cook=cooks.distance(model1)
summary(cook)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0001985 0.0033584 0.0119726 0.0346568 0.0436789 0.1607892

Οι κρίσιμες τιμές:

για τα jacknife residuals, k=1,n=27, a=0.05, κρίσιμη τιμή = 3.50

για το leverage: κρίσιμη τιμή=0.35 (περίπου)

για Cook’s distance: d (n-k-1)=17, d=0.68

(1.5)

shapiro.test(rjack)

## 
##  Shapiro-Wilk normality test
## 
## data:  rjack
## W = 0.95466, p-value = 0.2775

Πρόβλημα 2

load("prob2.Rdata")

(2.1)

model21=lm(undcount~perc_min+crimrate+poverty+diffeng+hsgrad+housing, data=prob2)
summary(model21)

## 
## Call:
## lm(formula = undcount ~ perc_min + crimrate + poverty + diffeng + 
##     hsgrad + housing, data = prob2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1511 -1.0921  0.0798  0.9336  4.3403 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.299641   1.334062   0.225 0.823061    
## perc_min     0.084726   0.023189   3.654 0.000551 ***
## crimrate     0.021489   0.013692   1.570 0.121876    
## poverty     -0.021048   0.084728  -0.248 0.804675    
## diffeng      0.180053   0.101960   1.766 0.082583 .  
## hsgrad      -0.040023   0.041607  -0.962 0.340018    
## housing     -0.006199   0.025658  -0.242 0.809936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.604 on 59 degrees of freedom
## Multiple R-squared:  0.6176, Adjusted R-squared:  0.5787 
## F-statistic: 15.88 on 6 and 59 DF,  p-value: 9.211e-11

(2.2)

model22=step(model21,direction="back")

## Start:  AIC=68.94
## undcount ~ perc_min + crimrate + poverty + diffeng + hsgrad + 
##     housing
## 
##            Df Sum of Sq    RSS    AIC
## - housing   1     0.150 151.87 67.003
## - poverty   1     0.159 151.88 67.007
## - hsgrad    1     2.379 154.10 67.965
## <none>                  151.72 68.938
## - crimrate  1     6.335 158.06 69.638
## - diffeng   1     8.019 159.74 70.337
## - perc_min  1    34.331 186.05 80.401
## 
## Step:  AIC=67
## undcount ~ perc_min + crimrate + poverty + diffeng + hsgrad
## 
##            Df Sum of Sq    RSS    AIC
## - poverty   1     0.181 152.05 65.082
## - hsgrad    1     2.874 154.75 66.240
## <none>                  151.87 67.003
## - crimrate  1     6.968 158.84 67.964
## - diffeng   1     7.907 159.78 68.353
## - perc_min  1    37.794 189.66 79.670
## 
## Step:  AIC=65.08
## undcount ~ perc_min + crimrate + diffeng + hsgrad
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  152.05 65.082
## - hsgrad    1     5.775 157.83 65.542
## - crimrate  1     6.851 158.90 65.990
## - diffeng   1     7.829 159.88 66.396
## - perc_min  1    42.390 194.44 79.312

model22

## 
## Call:
## lm(formula = undcount ~ perc_min + crimrate + diffeng + hsgrad, 
##     data = prob2)
## 
## Coefficients:
## (Intercept)     perc_min     crimrate      diffeng       hsgrad  
##     0.35633      0.08376      0.01974      0.17405     -0.04883

(2.3)

model23=lm(undcount~perc_min+poverty+hsgrad, data=prob2)
anova(model23,model21)

## Analysis of Variance Table
## 
## Model 1: undcount ~ perc_min + poverty + hsgrad
## Model 2: undcount ~ perc_min + crimrate + poverty + diffeng + hsgrad + 
##     housing
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)  
## 1     62 170.78                            
## 2     59 151.72  3    19.063 2.471 0.0706 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(2.4)

Cp-Mallows : \(Cp=\frac{SSE(p)}{MSE(k)}- [n-2(p+1)]\)

a21=anova(model21)
a21

## Analysis of Variance Table
## 
## Response: undcount
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## perc_min   1 195.816 195.816 76.1471 3.275e-12 ***
## crimrate   1  29.171  29.171 11.3438  0.001338 ** 
## poverty    1   5.428   5.428  2.1108  0.151559    
## diffeng    1  11.619  11.619  4.5184  0.037729 *  
## hsgrad     1   2.874   2.874  1.1177  0.294732    
## housing    1   0.150   0.150  0.0584  0.809936    
## Residuals 59 151.722   2.572                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a23=anova(model23)
a23

## Analysis of Variance Table
## 
## Response: undcount
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## perc_min   1 195.816 195.816 71.0873 7.085e-12 ***
## poverty    1  12.136  12.136  4.4056   0.03990 *  
## hsgrad     1  18.044  18.044  6.5506   0.01294 *  
## Residuals 62 170.784   2.755                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Παίρνουμε: \(SSE(p)=170.8, MSE(k)=2.572, n=66, k=6, p=3, Cp=8.45 > 3\), επομένως το μικρό μοντέλο είναι υποδεέστερο του πλήρους.

Πρόβλημα 3

load("prob3.Rdata")
prob3$elevc=prob3$elev-mean(prob3$elev)

(3.1)

model3=lm(damage~elevc+region+elevc*region, data=prob3)
model3

## 
## Call:
## lm(formula = damage ~ elevc + region + elevc * region, data = prob3)
## 
## Coefficients:
##       (Intercept)              elevc        regionNorth  elevc:regionNorth  
##          37.87043           -0.01721            5.38905            0.10839

summary(model3)

## 
## Call:
## lm(formula = damage ~ elevc + region + elevc * region, data = prob3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.781 -11.612   0.308  11.035  26.219 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       37.87043    6.31507   5.997 1.24e-07 ***
## elevc             -0.01721    0.01928  -0.893    0.375    
## regionNorth        5.38905    6.61930   0.814    0.419    
## elevc:regionNorth  0.10839    0.02333   4.646 1.90e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.61 on 60 degrees of freedom
## Multiple R-squared:  0.4556, Adjusted R-squared:  0.4284 
## F-statistic: 16.74 on 3 and 60 DF,  p-value: 5.132e-08