Instruction

Read the instruction carefully and think about how to develop R code to answer each questions.

Question 1

Consider AirPassengers dataset and explore the following data mining procedure for a time series data


00: background

The time series in R are encoded as ‘ts’ class (an extension of ‘vector’) and performed through many libraries:

  • forecast: A unified library to manage and forecast with time series
  • ggplot2: A extension to help with customizing autoplot() and autolayer()
  • fpp2: a library of data and method related to Forecasting: Principles and Practice by R.J. Hyndman and G.Athanasopoulos, one of the best online textbook about forecasting in R.

01: ts class

A basic information about data in ‘ts’ class can be derived using tsp() to reveal start, end, and frequency of time series data

tsp(AirPassengers)
## [1] 1949.000 1960.917   12.000

The command reveals start(), end(), and frequency() of time series data. Notice that ‘ts’ data are converted into fraction number, particular with interval of deltat()

start(AirPassengers)
## [1] 1949    1
end(AirPassengers)
## [1] 1960   12
frequency(AirPassengers)
## [1] 12
deltat(AirPassengers)
## [1] 0.08333333

NOTE: Full description of ‘ts’ object can be done by:

head(time(AirPassengers))
## [1] 1949.000 1949.083 1949.167 1949.250 1949.333 1949.417
head(cycle(AirPassengers))
## [1] 1 2 3 4 5 6

02: Slice

The ‘ts’ data slice can be done with window()

air.train <-  window(AirPassengers,start=1948.0,end=1958.917)
air.test  <-  window(AirPassengers,start=1959.0,end=1959.917)

03: visual

plot the decomposition of the data and separate data into training set (1948-1958) and testing set (1959)

Plot

The ‘ts’ data plot can be done with forecast::tsdisplay() with option of select type of lower.right with arg plot.type. Some even ignores type of plot, using forecast::autoplot()

forecast::tsdisplay(air.train)

forecast::tsdisplay(air.train, plot.type = "histogram")  

forecast::autoplot(air.train)

Decompose

decompose can be done using decompose() with “additive” or “multiplicative”. Each term can be extracted by forecast::trendcycle() forecast::seasadj() and forecast::remainder(), respectively.

forecast::autoplot(decompose(AirPassengers,type="additive"))

air.decom <- decompose(air.train,type="multiplicative")

detail of decomposion can be called by

  • trend and cycle forecast::trendcycle()
  • seasonal forecast::seasadj()
  • reminder forecast::remainder(
head(forecast::trendcycle(air.decom),36)
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1949       NA       NA       NA       NA       NA       NA 126.7917 127.2500
## 1950 131.2500 133.0833 134.9167 136.4167 137.4167 138.7500 140.9167 143.1667
## 1951 157.1250 159.5417 161.8333 164.1250 166.6667 169.0833 171.2500 173.5833
##           Sep      Oct      Nov      Dec
## 1949 127.9583 128.5833 129.0000 129.7500
## 1950 145.7083 148.4167 151.5417 154.7083
## 1951 175.4583 176.8333 178.0417 180.1667
#forecast::seasadj(air.decom)
#forecast::remainder(air.decom)
Note plot() is still work, but not popular. This is equivalent to stat::ts.plot(). forecast::autoplot() and forecast::autolayer() are more popular.
plot(AirPassengers)

ts.plot(air.train)

04: smoothing

develop the following smoothing forecasting models and compute accuracy

Outlier

forecast package can be used to verify outliers using Friedman’s super smoother with command forecast::tsoutliers()

### SOLUTION TO QUESTION 1B ### 

require(forecast)

tsoutliers(air.train)  
## $index
## [1]  79  91 103 104 115 116
## 
## $replacements
## [1] 334.0010 392.7796 435.2084 441.3101 443.7989 445.3073

Smoothing Group

forecast package embedded many exponential smoothing functions:

  • forecast::ses() Simple exponential smoothing model
  • forecast::holt() Holt’s linear trend model
  • forecast::hw() Holt-Winters’ model

They are parts of ets() a general exponential smoothing state space model ETS.

air.ses   <- ses(air.train,h=12)

air.holt  <- holt(air.train,h=12)

air.hw    <- hw(air.train,h=12)

autoplot(air.ses)

forecast package also have accuracy computation

##-- accuracy    
accuracy(air.test,air.hw$mean)  
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -61.87019 72.27274 61.87019 -16.24225 16.24225 0.6658303  2.835781

ARIMA Group

Before executing ARIMA, you should check acf and pacf with forecast::Acf() and forecast::Pacf(), respectively

forecast::Acf(air.train)  

forecast::Acf(diff(air.train)) 

forecast::Pacf(air.train)

forecast::Pacf(diff(air.train))

air.arima <- Arima(air.train,c(0,1,1),seasonal = list(order=c(0,1,1),period=12))


##-- auto optimize  
air.arima <- auto.arima(air.train) 

autoplot(forecast(air.arima,h=12))

checkresiduals(air.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)(0,1,0)[12]
## Q* = 32.841, df = 23, p-value = 0.08388
## 
## Model df: 1.   Total lags used: 24
NOTE forecast::Arima overwrites stat::arima(). This is a reason, why we cannot use arg ‘h’ in Arima() and use tsplot() or autoplot() without forecast().
##-- possible forecasing model
arima(air.train,c(0,1,1),seasonal = list(order=c(0,1,1),period=12))
## 
## Call:
## arima(x = air.train, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), 
##     period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.2278  -0.0459
## s.e.   0.0993   0.0933
## 
## sigma^2 estimated as 102.8:  log likelihood = -399.7,  aic = 805.41
plot(air.arima)

Other Group

  • forecast::naive() naive model (last constant value)
  • forecast::meanf() mean forecast model
  • forecast::rwf() random walk model
  • forecast::thetaf() Theta forecast method
plot(naive(air.train,h=12))

plot(meanf(air.train,h=12))

plot(rwf(air.train,h=12))

plot(thetaf(air.train,h=12))

plot(stlf(air.train,h=12))

Autolayer

Many forecasting can visually combine using autolayer() and autoplot()

require(ggplot2)
air.naive <- naive(air.train,h=12)

hSize <- 12
autoplot(air.train) + 
  autolayer(forecast(air.naive,h=hSize),series="naive",PI=F) +  
  autolayer(forecast(air.holt ,h=hSize),series="Holt",PI=F)  +
  autolayer(forecast(air.arima,h=hSize),series="ARIMA",PI=F) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(limits = c(100,550),breaks=seq(100,500,50), expand = c(0, 0)) +
  xlab("Year") + ylab("Passengers") + theme_bw()  

05: API

develop DOS/Java application for exponential series forecasting that takes any csv file as an input (hint: this require seperate R script and basic on Batch file)

idea

Here are step that we take for stand alone API

  • use fgui and tcltk packages to build API fGUI.R
  • embed the code in batch file batch file
    • set path/libary
    • execute the API with rscript using

export

  • export some data for API
  ## -------------------------------------------------------------------------
  ## export airPassenger Data 
  ## -------------------------------------------------------------------------

write.csv(AirPassengers,file = "air.csv")

fgui

The r code in fGUI.R consists of four part

  • set library of R when in DOS environment
  ## set path for r library and apply
  cmdArgs = commandArgs(trailingOnly = TRUE)
  libPath <- cmdArgs[1]
  .libPaths(c(libPath)) 
  • call library
## load library (require .libPaths)

require(tcltk)
require(fgui) ## simple java script package
require(forecast)
  • forecast function
## create sub-function
expoSeries  <-  function(period=12,opt="additive",alpha=0.5,
                         beta=0.05,csvFilename=NULL,predict=12)  {
  data <- read.csv( guiGetValue("csvFilename") ) [[2]]
  data.ts <- ts(data,start=1949,frequency=period)
  data.ht <- hw(data.ts,alpha=0.5,beta=0.01,seasonal=opt,h=predict)
  windows()
  plot(data.ht)
}
  • structure and run GUI
## run main GUI
gui(expoSeries,
     title="GUI: Exponental Smoothing Series",
     argSlider=list(alpha=c(0,1,0.1),
                    beta=c(0,.1,0.001)),
     argFilename = list(csvFilename=NULL),
     argOption = list(opt=c("additive","multiplicative"))
)

DOS

  • set path to bin location, e.g. R\R-3.6.3\bin\x64
  • create batch script
@echo off

:: SET VARIABLE ::
Set lib="C:/Users/Oran/Documents/R/win-library/3.6"


:: RUN R CMD ::
Rscript testScript.R %lib% 

pause

Alternatively, you can use service of ‘shinyapps.io’ with shinny using the following source code src

Question 2

Consider self-report weight/height and actual report of such measures of 200 participants


01: import

import/check for NA value separate into groups based on availability of data

  ### SOLUTION TO QUESTION 2A ### 

  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/selfReport.csv"

  self <- read.csv(url(fileURL),header = T)
  head(self) ## View(self)
##   sex weight height repwt repht
## 1   M     77    182    77   180
## 2   F     58    161    51   159
## 3   F     53    161    54   158
## 4   M     68    177    70   175
## 5   F     59    157    59   155
## 6   M     76    170    76   165
  • check for ‘NA’
  naSet<- as.data.frame(which(is.na(self),arr.ind = T))
  naSmy<- xtabs(~col+row,data=naSet)
  
  noRepWt<- as.numeric(names(naSmy[1,naSmy[1,]==0]))
  noRepHt<- as.numeric(names(naSmy[2,naSmy[2,]==0]))  
  noBoth <- as.numeric(names(naSmy[1,naSmy[1,]==1 & naSmy[2,]==1]))
  
  noNA   <- setdiff(1:nrow(self),c(noRepWt,noRepHt,noBoth))
  self.noNA <- self[noNA,]
  • Alternatively, complete.cases() can be applied to eliminate all ‘NA’
 self.noNA <- self[complete.cases(self),]

02: separate

  • for participants who report both weight and height, separate data into testing data (15%) and training data (85%)
  ### SOLUTION TO QUESTION 2B ### 

  set.seed(17)

  testingSet<- sample(noNA,size=round(length(noNA)*.15))
  self.test <- self[testingSet,]
  self.train<- self[setdiff(noNA,testingSet),] ### why not "self[-testingSet,]"? 

03: compare

  • For training set, check difference between report and actual
  ### SOLUTION TO QUESTION 2C ### 

  attach(self.train)
  self.train$gapHt <- height - repht
  self.train$gapWt <- weight - repwt  
  detach(self.train)

Does these commands work?

  xtabs(~sex+gapHt,data=self.train) ## do you notice any error?  how to correct it?
##    gapHt
## sex -106 -6 -3 -2 -1  0  1  2  3  4  5  6  7  8
##   F    1  1  0  2  1  8  8 21 25 14  2  2  0  0
##   M    0  0  1  3  5 14  7  8 20  6  2  1  1  1
  xtabs(~sex+gapWt,data=self.train) ## do you notice any error?  how to correct it?
##    gapWt
## sex -9 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  7 110
##   F  0  0  2  0  0  6 12 29 15  7  4  6  2  1   1
##   M  1  1  3  4  1 10 15 10 10  7  5  1  1  0   0
  head(self.train) ## fix(self.train)
##   sex weight height repwt repht gapHt gapWt
## 1   M     77    182    77   180     2     0
## 2   F     58    161    51   159     2     7
## 3   F     53    161    54   158     3    -1
## 5   F     59    157    59   155     2     0
## 7   M     76    167    77   165     2    -1
## 8   M     69    186    73   180     6    -4
  temp <- xtabs(~gapHt+gapWt,data=self.train)
  
  temp.mat <- as.matrix(temp)
  filled.contour(z=temp.mat,
                 x=as.numeric(rownames(temp.mat)),
                 y=as.numeric(colnames(temp.mat)),
                 nlevels=10)

04: predict

For training set, build a linear regression model to predict actual weight with actual height

  ### SOLUTION TO QUESTION 2D ### 

  weight.lm      <- lm(weight~height+sex,data=self.train)
  weight.predict <- predict(weight.lm,newdata=self.test)
  
  #str(weight.lm)
  
  coefficients(weight.lm)
## (Intercept)      height        sexM 
## 123.0152610  -0.3965508  22.7796728
  ## residuals(weight.lm)
  ## fitted.values(weight.lm)

lm() can be customized based on a user want. Here are some interesing variation

  • weight~. all variables (no interaction)
  • weight~.-1 all variables (no interaction) and no intercept
  • weight~height*sex all interaction between height and sex
  • weight~height:sex only interaction between height and sex
  • weight~I(sqrt(height)) apply sqrt() function to height
  ##-- advance construction of lm
  
  ##-- combination of variable, i.e., lm(weight~height+sex+height:sex,data=self.train)
  # lm(weight~height*sex,data=self.train)

  ##-- interaction term   
  # lm(weight~height:sex,data=self.train)  
  
  ##-- no constant term
  # lm(weight~height:sex -1,data=self.train)  
  
  ##-- apply function 
  # lm(weight~I(sqrt(height)),data=self.train)
  
  ##-- all main term
  # lm(weight~.,data=self.train)

05: visualize

standard visualize of lm() built in plot

  oldpar <- par()
  par(mfrow=c(2,2))
  plot(weight.lm)

  par(oldpar)

lm() has some value build-in

  anova(weight.lm)
## Analysis of Variance Table
## 
## Response: weight
##            Df  Sum Sq Mean Sq F value Pr(>F)    
## height      1   247.8   247.8  1.7252  0.191    
## sex         1 13510.9 13510.9 94.0617 <2e-16 ***
## Residuals 151 21689.5   143.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  confint(weight.lm)
##                  2.5 %      97.5 %
## (Intercept) 93.6005950 152.4299271
## height      -0.5758924  -0.2172091
## sexM        18.1389697  27.4203760
  forecast::accuracy(weight.predict,self.test$weight)  
##                ME    RMSE     MAE      MPE     MAPE
## Test set 2.973163 13.1365 9.31695 1.648525 12.46379

06: Assumption

  • L: Linear relationship using cor or visual inspection
  • I: Independence or at least no AR(1) using car::durbinWatsonText() or ols_vif_tol() (VIF<5.0)
  • N: Normality using ols_test_normality()
  • E: weak Exogeneity (\(x_j\) is not random variable). In Engineering, this assumption is usually ignored.
  • S: homoScedasticity (single variance) using ols_plot_resid_fit()

The main package to identify these assumption is olsrr

  require(olsrr)

  # ols_plot_resid_hist(weight.lm)
  # ols_plot_resid_fit(weight.lm) + theme_bw()
  # ols_plot_cooksd_chart(weight.lm)  ## check for major influence points
  # ols_plot_added_variable(weight.lm)
  # ols_plot_response(weight.lm)
  
  olsrr::ols_vif_tol(weight.lm)
##   Variables Tolerance      VIF
## 1    height 0.6836651 1.462705
## 2      sexM 0.6836651 1.462705
  ols_test_normality(weight.lm)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9308         0.0000 
## Kolmogorov-Smirnov        0.0716         0.4083 
## Cramer-von Mises         12.5463         0.0000 
## Anderson-Darling          1.3591         0.0016 
## -----------------------------------------------
  ols_test_correlation(weight.lm)
## [1] 0.9613993
  ols_test_outlier(weight.lm)
##    studentized_residual unadjusted_p_val bonferroni_p_val
## 12             14.54216     1.899326e-30     2.924962e-28

Question 3

Apply your knowledge of logistic regression into the following dataset


01: import

  • observe pattern
  ### SOLUTION TO QUESTION 3A ### 

  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/height.csv"

  height <- read.csv(url(fileURL),header=T)
  height$Gender <- factor(height$Gender,levels = c(1,2)
                         ,labels = c("F","M"))  
  xtabs(~Gender+Height,data=height)
##       Height
## Gender 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79
##      F  2  3  1  2  4  3  6 12  6  5  3  1  2  0  0  0  0  0  0
##      M  0  0  0  0  0  1  0  1  1  4  5  5  8  3 13  5  2  1  1

02: construct

  height.glm <- glm(Gender~Height,data=height,family = "binomial")
  
  
  prob <- data.frame(Gender=factor(rep("M",21)),Height=60:80)
  prob$prob<- predict(height.glm,newdata= prob,type = "response")
  plot(prob$Height,prob$prob,col="black")

  • What is the cut-off threshold?
  • Can we apply other methods?

99: EXTRA

  • Consider the previous dataset Davis or self.train, construct a model to predict sex of participants using other factors.
  ### SOLUTION TO QUESTION 3B ### 

  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/selfReport.csv"

  self <- read.csv(file=fileURL,header = T)
  self$sex   <- factor(self$sex)
  self$gapHt <- self$height - self$repht
  self$gapWt <- self$weight - self$repwt 
  

  noNA <- setdiff(1:200,unique(which(is.na(self),arr.ind = T)[,1]))
  
  set.seed(17)
  testingSet<- sample(noNA,size=round(length(noNA)*.15))  
  self.test <- self[testingSet,]
  self.train<- self[setdiff(noNA,testingSet),]  ### why not "self[-testingSet,]"? 
  
  sex.log <- glm(sex~gapWt+gapHt,data=self.train,family="binomial")
  sex.best<- step(glm(sex~.,data=self.train,family="binomial"))
  
  anova(sex.log)
  anova(sex.best)
  
  self.test$probSex<- predict(sex.log,newdata=self.test,type = "response")
  self.test$bestSex<- predict(sex.best,newdata=self.test,type = "response")
  self$sex

Quesiton 4

Consider the award dataset of difference student groups with different Mathematics score and follow the data mining steps

01: import

  • import and analysis the data
  ### SOLUTION TO QUESTION 4A ### 
  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/poisson_sim.csv"

  require(data.table)
  award.DT <- as.data.table(read.csv(url(fileURL),header=T))
  award.DT[,prog:= factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))]
  award.DT[,num_awards:=factor(num_awards)]
    
  require(ggplot2)
  require(dplyr)
  award.DT %>% ggplot( aes(x=num_awards,y=math,fill=prog) ) + 
    geom_violin() + facet_grid(~prog) 

  award.DT[,.(min=min(math),avg=mean(math),max=max(math),sd=sd(math),.N),by=.(prog,num_awards)] -> temp.DT


  
  xtabs(avg~prog+num_awards,temp.DT)
##             num_awards
## prog                0        1        2        3        4        5        6
##   General    49.44444 52.33333  0.00000  0.00000  0.00000  0.00000  0.00000
##   Academic   52.47917 57.78125 64.36364 60.88889 66.00000 66.00000 69.00000
##   Vocational 45.62500 45.62500 65.50000  0.00000  0.00000  0.00000  0.00000
  xtabs(N~prog+num_awards,temp.DT)
##             num_awards
## prog          0  1  2  3  4  5  6
##   General    36  9  0  0  0  0  0
##   Academic   48 32 11  9  2  2  1
##   Vocational 40  8  2  0  0  0  0

02: glm()

  • use Poisson Regression to predict response
  award.DT[,count_awards:=as.numeric(num_awards)]
  award.pois <- glm(count_awards ~ prog + math, family="poisson", data=award.DT)
  summary(award.pois)
## 
## Call:
## glm(formula = count_awards ~ prog + math, family = "poisson", 
##     data = award.DT)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.233114   0.362234  -3.404 0.000664 ***
## progAcademic    0.315241   0.159874   1.972 0.048631 *  
## progVocational  0.127720   0.187238   0.682 0.495161    
## math            0.027879   0.006517   4.278 1.89e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 105.241  on 199  degrees of freedom
## Residual deviance:  68.215  on 196  degrees of freedom
## AIC: 538.66
## 
## Number of Fisher Scoring iterations: 4
  head(predict(award.pois,type="response"))
##         1         2         3         4         5         6 
## 1.0383551 0.9138565 1.1289346 1.0677106 1.0098068 0.9396922
  award.grid <- expand.grid( math=seq(32,70,2)
                             ,prog=factor(1:3, labels=c("General", "Academic", "Vocational"))
                             ,num_awards=NA)
  
  award.grid$pois_awards <- predict(award.pois,newdata = award.grid,type="response")

  xtabs(pois_awards~prog+math,award.grid)
##             math
## prog                32        34        36        38        40        42
##   General    0.7110651 0.7518386 0.7949501 0.8405337 0.8887311 0.9396922
##   Academic   0.9745785 1.0304622 1.0895504 1.1520268 1.2180857 1.2879324
##   Vocational 0.8079366 0.8542649 0.9032496 0.9550433 1.0098068 1.0677106
##             math
## prog                44        46        48        50        52        54
##   General    0.9935755 1.0505485 1.1107885 1.1744827 1.2418292 1.3130374
##   Academic   1.3617843 1.4398710 1.5224352 1.6097338 1.7020383 1.7996356
##   Vocational 1.1289346 1.1936693 1.2621160 1.3344876 1.4110090 1.4919183
##             math
## prog                56        58        60        62        64        66
##   General    1.3883289 1.4679376 1.5521113 1.6411115 1.7352152 1.8347149
##   Academic   1.9028292 2.0119402 2.1273077 2.2492906 2.3782681 2.5146414
##   Vocational 1.5774670 1.6679212 1.7635622 1.8646874 1.9716112 2.0846662
##             math
## prog                68        70
##   General    1.9399201 2.0511579
##   Academic   2.6588345 2.8112959
##   Vocational 2.2042039 2.3305961
  ##-- 
  # plot(jitter(award.DT$math),jitter(award.DT$count_awards),col=award.DT$prog,pch=16,cex=0.5,xlab="Math",ylab="# awards")   

What is your conclusion?

03: compare

  • compare the Poisson regression with simple linear model and calculate AIC of each model
  ### SOLUTION TO QUESTION 4C ### 

  award.lm   <- lm(count_awards ~ prog + math, data=award.DT)
  summary(award.lm)
## 
## Call:
## lm(formula = count_awards ~ prog + math, data = award.DT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7311 -0.5618 -0.1537  0.2851  4.4126 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.195504   0.411417  -2.906  0.00408 ** 
## progAcademic    0.478613   0.168956   2.833  0.00510 ** 
## progVocational  0.212506   0.187433   1.134  0.25828    
## math            0.047889   0.007773   6.161 4.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9019 on 196 degrees of freedom
## Multiple R-squared:  0.2773, Adjusted R-squared:  0.2662 
## F-statistic: 25.07 on 3 and 196 DF,  p-value: 9.016e-14
  award.grid$lm_awards <- predict(award.lm, newdata=award.grid,type="response")
  xtabs(lm_awards~prog+math,award.grid)  
##             math
## prog                32        34        36        38        40        42
##   General    0.3369374 0.4327150 0.5284926 0.6242702 0.7200478 0.8158254
##   Academic   0.8155504 0.9113280 1.0071056 1.1028832 1.1986608 1.2944384
##   Vocational 0.5494435 0.6452211 0.7409987 0.8367763 0.9325539 1.0283315
##             math
## prog                44        46        48        50        52        54
##   General    0.9116030 1.0073806 1.1031582 1.1989358 1.2947134 1.3904910
##   Academic   1.3902160 1.4859936 1.5817712 1.6775488 1.7733264 1.8691039
##   Vocational 1.1241091 1.2198867 1.3156643 1.4114419 1.5072195 1.6029971
##             math
## prog                56        58        60        62        64        66
##   General    1.4862686 1.5820462 1.6778238 1.7736014 1.8693790 1.9651566
##   Academic   1.9648815 2.0606591 2.1564367 2.2522143 2.3479919 2.4437695
##   Vocational 1.6987747 1.7945523 1.8903299 1.9861075 2.0818851 2.1776627
##             math
## prog                68        70
##   General    2.0609342 2.1567118
##   Academic   2.5395471 2.6353247
##   Vocational 2.2734403 2.3692179
  xtabs(pois_awards~prog+math,award.grid)  
##             math
## prog                32        34        36        38        40        42
##   General    0.7110651 0.7518386 0.7949501 0.8405337 0.8887311 0.9396922
##   Academic   0.9745785 1.0304622 1.0895504 1.1520268 1.2180857 1.2879324
##   Vocational 0.8079366 0.8542649 0.9032496 0.9550433 1.0098068 1.0677106
##             math
## prog                44        46        48        50        52        54
##   General    0.9935755 1.0505485 1.1107885 1.1744827 1.2418292 1.3130374
##   Academic   1.3617843 1.4398710 1.5224352 1.6097338 1.7020383 1.7996356
##   Vocational 1.1289346 1.1936693 1.2621160 1.3344876 1.4110090 1.4919183
##             math
## prog                56        58        60        62        64        66
##   General    1.3883289 1.4679376 1.5521113 1.6411115 1.7352152 1.8347149
##   Academic   1.9028292 2.0119402 2.1273077 2.2492906 2.3782681 2.5146414
##   Vocational 1.5774670 1.6679212 1.7635622 1.8646874 1.9716112 2.0846662
##             math
## prog                68        70
##   General    1.9399201 2.0511579
##   Academic   2.6588345 2.8112959
##   Vocational 2.2042039 2.3305961
  AIC(award.pois)
## [1] 538.6557
  AIC(award.lm)
## [1] 532.2495

Question 5

Consider the dataset Professor Salaries and follow these instruction.


01: import

import data from MS Excel file with package xlsx

  ### SOLUTION TO QUESTION 5A ### 
 
  require(openxlsx)
  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/Salaries.xlsx" 

  ##-- import sheet by name
  prof <- read.xlsx(xlsxFile=fileURL,sheet = "Salaries") 
  
  ##-- import sheet by index
  prof <- read.xlsx(xlsxFile=fileURL, sheet=1)   
  glimpse(prof) 
    
  ##-- import specific row/col of sheet index
  wb       <- loadWorkbook("Salaries.xlsx") ##-- where is your file?  
  sheets   <- getSheets(wb)
  sheet    <- sheets[[1]] ##-- access to first sheet
  nLastRow <- sheet$getLastRowNum()
  prof     <- as.data.frame(readRows(sheet,startRow=2,endRow=nLastRow,startColumn = 1)) 
  colnames(prof) <- readRows(sheet,startRow=1,endRow=1,startColumn = 1)                          
  glimpse(prof)  ##-- this leads to ERROR; can you explain WHY? 
  
  
  require(dplyr)
  require(lattice)
  ##-- convert data into appropiate type
  prof <- mutate(prof,rank=factor(rank,labels=c("AsstProf","AssocProf","Prof")),
                 discipline=factor(discipline),
                 yrs.since.phd=as.numeric(yrs.since.phd) ,
                 yrs.service=as.numeric(yrs.service) ,
                 sex=factor(sex,labels=c("M","F")),
                 salary=as.numeric(salary) ) 
  
  ##-- play with data
  alys01 <- filter(prof,discipline=="B") %>% group_by(rank,sex) %>% summarise(avg=mean(salary),n=length(salary))
  xtabs(avg~sex+rank,data=alys01)  

02: observe

observe the pattern of **salary** and relationship of each column

  ### SOLUTION TO QUESTION 5B ### 

  glimpse(prof) ## data contain quantitative (number) // qualitative (description)
  
  ##-- analysis of qualitative data (description)
  bwplot(salary~rank|discipline:sex,data=prof) ##-- is there any diff in terms of rank, discipline, and sex 
  anova(lm(salary~rank*sex*discipline,data=prof))

  ##-- analysis of quantitative (number) data
  cor(prof[,c(3,4,6)])
  anova(lm(salary~yrs.since.phd+yrs.service,data=prof))
  pairs(prof[,c(3,4,6)],col=prof$discipline)
  
  ##-- analysis of both  
  xyplot(salary~yrs.since.phd|(discipline:rank),data=prof)
  xyplot(salary~yrs.service|(discipline:rank),data=prof)  
  prof.lm  <- step( lm(salary~.,data=prof)) ## find the best model
  anova(prof.lm)

02: hist()

visualize histogram of salary base on main factors

  ### SOLUTION TO QUESTION 5C ### 

  prof.hist <- hist(prof$salary,levels=20,plot = F)
  prof.B    <- filter(prof,discipline=="B")
  prof.Asst <- filter(prof,rank=="AsstProf")
  prof.nFull<- filter(prof,rank!="Prof")
  
  hist(prof$salary,breaks=prof.hist$breaks,col="blue")
  hist(prof.B$salary,breaks=prof.hist$breaks,col="red",add=T)
  rug(prof$salary,col="blue")
  rug(prof.B$salary,col="red")  
  
  hist(prof$salary,breaks=prof.hist$breaks,col="blue")
  hist(prof.nFull$salary,breaks=prof.hist$breaks,col="red",add=T)
  hist(prof.Asst$salary,breaks=prof.hist$breaks,col="orange",add=T)  
  rug(prof$salary,col="blue")
  rug(prof.nFull$salary,col="red")
  rug(prof.Asst$salary,col="orange")      

03: OLR

predict salary with ordinary linear regression model

  ### SOLUTION TO QUESTION 5D ### 

  prof.ctree <- ctree(salary~.,data=prof)
  plot(prof.ctree)
  summary(prof.ctree)

04: DT

use decision tree (ctree in party package or rpart in rpart package) to analysis data

  ### SOLUTION TO QUESTION 5E ### 

  prof.ctree <- ctree(salary~.,data=prof)
  plot(prof.ctree)
  summary(prof.ctree)


Copyright 2019   Oran Kittithreerapronchai.   All Rights Reserved.   Last modified: 2023-46-07,