Instruction

Read the instruction carefully and think about how to develop R code to answer each questions.

Question 1

Consider AirPassengers dataset and explore the following data mining procedure for a time series data

00: background

The time series in R are encoded as ‘ts’ class (an extension of ‘vector’) and performed through many libraries:

forecast: A unified library to manage and forecast with time series
ggplot2: A extension to help with customizing autoplot() and autolayer()
fpp2: a library of data and method related to Forecasting: Principles and Practice by R.J. Hyndman and G.Athanasopoulos, one of the best online textbook about forecasting in R.

01: `ts` class

A basic information about data in ‘ts’ class can be derived using tsp() to reveal start, end, and frequency of time series data

tsp(AirPassengers)

## [1] 1949.000 1960.917   12.000

The command reveals start(), end(), and frequency() of time series data. Notice that ‘ts’ data are converted into fraction number, particular with interval of deltat()

start(AirPassengers)

## [1] 1949    1

end(AirPassengers)

## [1] 1960   12

frequency(AirPassengers)

## [1] 12

deltat(AirPassengers)

## [1] 0.08333333

NOTE: Full description of ‘ts’ object can be done by:

head(time(AirPassengers))

## [1] 1949.000 1949.083 1949.167 1949.250 1949.333 1949.417

head(cycle(AirPassengers))

## [1] 1 2 3 4 5 6

02: Slice

The ‘ts’ data slice can be done with window()

air.train <-  window(AirPassengers,start=1948.0,end=1958.917)
air.test  <-  window(AirPassengers,start=1959.0,end=1959.917)

03: visual

plot the decomposition of the data and separate data into training set (1948-1958) and testing set (1959)

Plot

The ‘ts’ data plot can be done with forecast::tsdisplay() with option of select type of lower.right with arg plot.type. Some even ignores type of plot, using forecast::autoplot()

forecast::tsdisplay(air.train)

forecast::tsdisplay(air.train, plot.type = "histogram")

forecast::autoplot(air.train)

Decompose

decompose can be done using decompose() with “additive” or “multiplicative”. Each term can be extracted by forecast::trendcycle() forecast::seasadj() and forecast::remainder(), respectively.

forecast::autoplot(decompose(AirPassengers,type="additive"))

air.decom <- decompose(air.train,type="multiplicative")

detail of decomposion can be called by

trend and cycle forecast::trendcycle()
seasonal forecast::seasadj()
reminder forecast::remainder(

head(forecast::trendcycle(air.decom),36)

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1949       NA       NA       NA       NA       NA       NA 126.7917 127.2500
## 1950 131.2500 133.0833 134.9167 136.4167 137.4167 138.7500 140.9167 143.1667
## 1951 157.1250 159.5417 161.8333 164.1250 166.6667 169.0833 171.2500 173.5833
##           Sep      Oct      Nov      Dec
## 1949 127.9583 128.5833 129.0000 129.7500
## 1950 145.7083 148.4167 151.5417 154.7083
## 1951 175.4583 176.8333 178.0417 180.1667

#forecast::seasadj(air.decom)
#forecast::remainder(air.decom)

Note `plot()` is still work, but not popular. This is equivalent to `stat::ts.plot()`. `forecast::autoplot()` and `forecast::autolayer()` are more popular.

plot(AirPassengers)

ts.plot(air.train)

04: smoothing

develop the following smoothing forecasting models and compute accuracy

Outlier

forecast package can be used to verify outliers using Friedman’s super smoother with command forecast::tsoutliers()

### SOLUTION TO QUESTION 1B ### 

require(forecast)

tsoutliers(air.train)

## $index
## [1]  79  91 103 104 115 116
## 
## $replacements
## [1] 334.0010 392.7796 435.2084 441.3101 443.7989 445.3073

Smoothing Group

forecast package embedded many exponential smoothing functions:

forecast::ses() Simple exponential smoothing model
forecast::holt() Holt’s linear trend model
forecast::hw() Holt-Winters’ model

They are parts of ets() a general exponential smoothing state space model ETS.

air.ses   <- ses(air.train,h=12)

air.holt  <- holt(air.train,h=12)

air.hw    <- hw(air.train,h=12)

autoplot(air.ses)

forecast package also have accuracy computation

##-- accuracy    
accuracy(air.test,air.hw$mean)

##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -61.87019 72.27274 61.87019 -16.24225 16.24225 0.6658303  2.835781

ARIMA Group

Before executing ARIMA, you should check acf and pacf with forecast::Acf() and forecast::Pacf(), respectively

forecast::Acf(air.train)

forecast::Acf(diff(air.train))

forecast::Pacf(air.train)

forecast::Pacf(diff(air.train))

air.arima <- Arima(air.train,c(0,1,1),seasonal = list(order=c(0,1,1),period=12))


##-- auto optimize  
air.arima <- auto.arima(air.train) 

autoplot(forecast(air.arima,h=12))

checkresiduals(air.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)(0,1,0)[12]
## Q* = 32.841, df = 23, p-value = 0.08388
## 
## Model df: 1.   Total lags used: 24

NOTE `forecast::Arima` overwrites `stat::arima()`. This is a reason, why we cannot use arg ‘h’ in `Arima()` and use `tsplot()` or `autoplot()` without `forecast()`.

##-- possible forecasing model
arima(air.train,c(0,1,1),seasonal = list(order=c(0,1,1),period=12))

## 
## Call:
## arima(x = air.train, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), 
##     period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.2278  -0.0459
## s.e.   0.0993   0.0933
## 
## sigma^2 estimated as 102.8:  log likelihood = -399.7,  aic = 805.41

plot(air.arima)

Other Group

forecast::naive() naive model (last constant value)
forecast::meanf() mean forecast model
forecast::rwf() random walk model
forecast::thetaf() Theta forecast method

plot(naive(air.train,h=12))

plot(meanf(air.train,h=12))

plot(rwf(air.train,h=12))

plot(thetaf(air.train,h=12))

plot(stlf(air.train,h=12))

Autolayer

Many forecasting can visually combine using autolayer() and autoplot()

require(ggplot2)
air.naive <- naive(air.train,h=12)

hSize <- 12
autoplot(air.train) + 
  autolayer(forecast(air.naive,h=hSize),series="naive",PI=F) +  
  autolayer(forecast(air.holt ,h=hSize),series="Holt",PI=F)  +
  autolayer(forecast(air.arima,h=hSize),series="ARIMA",PI=F) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(limits = c(100,550),breaks=seq(100,500,50), expand = c(0, 0)) +
  xlab("Year") + ylab("Passengers") + theme_bw()

05: API

develop DOS/Java application for exponential series forecasting that takes any csv file as an input (hint: this require seperate R script and basic on Batch file)

idea

Here are step that we take for stand alone API

use fgui and tcltk packages to build API fGUI.R
embed the code in batch file batch file
- set path/libary
- execute the API with rscript using

export

export some data for API

  ## -------------------------------------------------------------------------
  ## export airPassenger Data 
  ## -------------------------------------------------------------------------

write.csv(AirPassengers,file = "air.csv")

fgui

The r code in fGUI.R consists of four part

set library of R when in DOS environment

  ## set path for r library and apply
  cmdArgs = commandArgs(trailingOnly = TRUE)
  libPath <- cmdArgs[1]
  .libPaths(c(libPath))

call library

## load library (require .libPaths)

require(tcltk)
require(fgui) ## simple java script package
require(forecast)

forecast function

## create sub-function
expoSeries  <-  function(period=12,opt="additive",alpha=0.5,
                         beta=0.05,csvFilename=NULL,predict=12)  {
  data <- read.csv( guiGetValue("csvFilename") ) [[2]]
  data.ts <- ts(data,start=1949,frequency=period)
  data.ht <- hw(data.ts,alpha=0.5,beta=0.01,seasonal=opt,h=predict)
  windows()
  plot(data.ht)
}

structure and run GUI

## run main GUI
gui(expoSeries,
     title="GUI: Exponental Smoothing Series",
     argSlider=list(alpha=c(0,1,0.1),
                    beta=c(0,.1,0.001)),
     argFilename = list(csvFilename=NULL),
     argOption = list(opt=c("additive","multiplicative"))
)

DOS

set path to bin location, e.g. R\R-3.6.3\bin\x64
create batch script

@echo off

:: SET VARIABLE ::
Set lib="C:/Users/Oran/Documents/R/win-library/3.6"


:: RUN R CMD ::
Rscript testScript.R %lib% 

pause

Alternatively, you can use service of ‘shinyapps.io’ with shinny using the following source code src

Question 2

Consider self-report weight/height and actual report of such measures of 200 participants

01: import

import/check for NA value separate into groups based on availability of data

  ### SOLUTION TO QUESTION 2A ### 

  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/selfReport.csv"

  self <- read.csv(url(fileURL),header = T)
  head(self) ## View(self)

##   sex weight height repwt repht
## 1   M     77    182    77   180
## 2   F     58    161    51   159
## 3   F     53    161    54   158
## 4   M     68    177    70   175
## 5   F     59    157    59   155
## 6   M     76    170    76   165

check for ‘NA’

  naSet<- as.data.frame(which(is.na(self),arr.ind = T))
  naSmy<- xtabs(~col+row,data=naSet)
  
  noRepWt<- as.numeric(names(naSmy[1,naSmy[1,]==0]))
  noRepHt<- as.numeric(names(naSmy[2,naSmy[2,]==0]))  
  noBoth <- as.numeric(names(naSmy[1,naSmy[1,]==1 & naSmy[2,]==1]))
  
  noNA   <- setdiff(1:nrow(self),c(noRepWt,noRepHt,noBoth))
  self.noNA <- self[noNA,]

Alternatively, complete.cases() can be applied to eliminate all ‘NA’

 self.noNA <- self[complete.cases(self),]

02: separate

for participants who report both weight and height, separate data into testing data (15%) and training data (85%)

  ### SOLUTION TO QUESTION 2B ### 

  set.seed(17)

  testingSet<- sample(noNA,size=round(length(noNA)*.15))
  self.test <- self[testingSet,]
  self.train<- self[setdiff(noNA,testingSet),] ### why not "self[-testingSet,]"?

03: compare

For training set, check difference between report and actual

  ### SOLUTION TO QUESTION 2C ### 

  attach(self.train)
  self.train$gapHt <- height - repht
  self.train$gapWt <- weight - repwt  
  detach(self.train)

Does these commands work?

  xtabs(~sex+gapHt,data=self.train) ## do you notice any error?  how to correct it?

##    gapHt
## sex -106 -6 -3 -2 -1  0  1  2  3  4  5  6  7  8
##   F    1  1  0  2  1  8  8 21 25 14  2  2  0  0
##   M    0  0  1  3  5 14  7  8 20  6  2  1  1  1

  xtabs(~sex+gapWt,data=self.train) ## do you notice any error?  how to correct it?

##    gapWt
## sex -9 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  7 110
##   F  0  0  2  0  0  6 12 29 15  7  4  6  2  1   1
##   M  1  1  3  4  1 10 15 10 10  7  5  1  1  0   0

  head(self.train) ## fix(self.train)

##   sex weight height repwt repht gapHt gapWt
## 1   M     77    182    77   180     2     0
## 2   F     58    161    51   159     2     7
## 3   F     53    161    54   158     3    -1
## 5   F     59    157    59   155     2     0
## 7   M     76    167    77   165     2    -1
## 8   M     69    186    73   180     6    -4

  temp <- xtabs(~gapHt+gapWt,data=self.train)
  
  temp.mat <- as.matrix(temp)
  filled.contour(z=temp.mat,
                 x=as.numeric(rownames(temp.mat)),
                 y=as.numeric(colnames(temp.mat)),
                 nlevels=10)

04: predict

For training set, build a linear regression model to predict actual weight with actual height

  ### SOLUTION TO QUESTION 2D ### 

  weight.lm      <- lm(weight~height+sex,data=self.train)
  weight.predict <- predict(weight.lm,newdata=self.test)
  
  #str(weight.lm)
  
  coefficients(weight.lm)

## (Intercept)      height        sexM 
## 123.0152610  -0.3965508  22.7796728

  ## residuals(weight.lm)
  ## fitted.values(weight.lm)

lm() can be customized based on a user want. Here are some interesing variation

weight~. all variables (no interaction)
weight~.-1 all variables (no interaction) and no intercept
weight~height*sex all interaction between height and sex
weight~height:sex only interaction between height and sex
weight~I(sqrt(height)) apply sqrt() function to height

  ##-- advance construction of lm
  
  ##-- combination of variable, i.e., lm(weight~height+sex+height:sex,data=self.train)
  # lm(weight~height*sex,data=self.train)

  ##-- interaction term   
  # lm(weight~height:sex,data=self.train)  
  
  ##-- no constant term
  # lm(weight~height:sex -1,data=self.train)  
  
  ##-- apply function 
  # lm(weight~I(sqrt(height)),data=self.train)
  
  ##-- all main term
  # lm(weight~.,data=self.train)

05: visualize

standard visualize of lm() built in plot

  oldpar <- par()
  par(mfrow=c(2,2))
  plot(weight.lm)

  par(oldpar)

lm() has some value build-in

  anova(weight.lm)

## Analysis of Variance Table
## 
## Response: weight
##            Df  Sum Sq Mean Sq F value Pr(>F)    
## height      1   247.8   247.8  1.7252  0.191    
## sex         1 13510.9 13510.9 94.0617 <2e-16 ***
## Residuals 151 21689.5   143.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  confint(weight.lm)

##                  2.5 %      97.5 %
## (Intercept) 93.6005950 152.4299271
## height      -0.5758924  -0.2172091
## sexM        18.1389697  27.4203760

  forecast::accuracy(weight.predict,self.test$weight)

##                ME    RMSE     MAE      MPE     MAPE
## Test set 2.973163 13.1365 9.31695 1.648525 12.46379

06: Assumption

L: Linear relationship using cor or visual inspection
I: Independence or at least no AR(1) using car::durbinWatsonText() or ols_vif_tol() (VIF<5.0)
N: Normality using ols_test_normality()
E: weak Exogeneity (\(x_j\) is not random variable). In Engineering, this assumption is usually ignored.
S: homoScedasticity (single variance) using ols_plot_resid_fit()

The main package to identify these assumption is olsrr

  require(olsrr)

  # ols_plot_resid_hist(weight.lm)
  # ols_plot_resid_fit(weight.lm) + theme_bw()
  # ols_plot_cooksd_chart(weight.lm)  ## check for major influence points
  # ols_plot_added_variable(weight.lm)
  # ols_plot_response(weight.lm)
  
  olsrr::ols_vif_tol(weight.lm)

##   Variables Tolerance      VIF
## 1    height 0.6836651 1.462705
## 2      sexM 0.6836651 1.462705

  ols_test_normality(weight.lm)

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9308         0.0000 
## Kolmogorov-Smirnov        0.0716         0.4083 
## Cramer-von Mises         12.5463         0.0000 
## Anderson-Darling          1.3591         0.0016 
## -----------------------------------------------

  ols_test_correlation(weight.lm)

## [1] 0.9613993

  ols_test_outlier(weight.lm)

##    studentized_residual unadjusted_p_val bonferroni_p_val
## 12             14.54216     1.899326e-30     2.924962e-28

Question 3

Apply your knowledge of logistic regression into the following dataset

01: import

observe pattern

  ### SOLUTION TO QUESTION 3A ### 

  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/height.csv"

  height <- read.csv(url(fileURL),header=T)
  height$Gender <- factor(height$Gender,levels = c(1,2)
                         ,labels = c("F","M"))  
  xtabs(~Gender+Height,data=height)

##       Height
## Gender 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79
##      F  2  3  1  2  4  3  6 12  6  5  3  1  2  0  0  0  0  0  0
##      M  0  0  0  0  0  1  0  1  1  4  5  5  8  3 13  5  2  1  1

02: construct

  height.glm <- glm(Gender~Height,data=height,family = "binomial")
  
  
  prob <- data.frame(Gender=factor(rep("M",21)),Height=60:80)
  prob$prob<- predict(height.glm,newdata= prob,type = "response")
  plot(prob$Height,prob$prob,col="black")

What is the cut-off threshold?
Can we apply other methods?

99: EXTRA

Consider the previous dataset Davis or self.train, construct a model to predict sex of participants using other factors.

  ### SOLUTION TO QUESTION 3B ### 

  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/selfReport.csv"

  self <- read.csv(file=fileURL,header = T)
  self$sex   <- factor(self$sex)
  self$gapHt <- self$height - self$repht
  self$gapWt <- self$weight - self$repwt 
  

  noNA <- setdiff(1:200,unique(which(is.na(self),arr.ind = T)[,1]))
  
  set.seed(17)
  testingSet<- sample(noNA,size=round(length(noNA)*.15))  
  self.test <- self[testingSet,]
  self.train<- self[setdiff(noNA,testingSet),]  ### why not "self[-testingSet,]"? 
  
  sex.log <- glm(sex~gapWt+gapHt,data=self.train,family="binomial")
  sex.best<- step(glm(sex~.,data=self.train,family="binomial"))
  
  anova(sex.log)
  anova(sex.best)
  
  self.test$probSex<- predict(sex.log,newdata=self.test,type = "response")
  self.test$bestSex<- predict(sex.best,newdata=self.test,type = "response")
  self$sex

Quesiton 4

Consider the award dataset of difference student groups with different Mathematics score and follow the data mining steps

01: import

import and analysis the data

  ### SOLUTION TO QUESTION 4A ### 
  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/poisson_sim.csv"

  require(data.table)
  award.DT <- as.data.table(read.csv(url(fileURL),header=T))
  award.DT[,prog:= factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))]
  award.DT[,num_awards:=factor(num_awards)]
    
  require(ggplot2)
  require(dplyr)
  award.DT %>% ggplot( aes(x=num_awards,y=math,fill=prog) ) + 
    geom_violin() + facet_grid(~prog)

  award.DT[,.(min=min(math),avg=mean(math),max=max(math),sd=sd(math),.N),by=.(prog,num_awards)] -> temp.DT


  
  xtabs(avg~prog+num_awards,temp.DT)

##             num_awards
## prog                0        1        2        3        4        5        6
##   General    49.44444 52.33333  0.00000  0.00000  0.00000  0.00000  0.00000
##   Academic   52.47917 57.78125 64.36364 60.88889 66.00000 66.00000 69.00000
##   Vocational 45.62500 45.62500 65.50000  0.00000  0.00000  0.00000  0.00000

  xtabs(N~prog+num_awards,temp.DT)

##             num_awards
## prog          0  1  2  3  4  5  6
##   General    36  9  0  0  0  0  0
##   Academic   48 32 11  9  2  2  1
##   Vocational 40  8  2  0  0  0  0

02: glm()

use Poisson Regression to predict response

  award.DT[,count_awards:=as.numeric(num_awards)]
  award.pois <- glm(count_awards ~ prog + math, family="poisson", data=award.DT)
  summary(award.pois)

## 
## Call:
## glm(formula = count_awards ~ prog + math, family = "poisson", 
##     data = award.DT)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.233114   0.362234  -3.404 0.000664 ***
## progAcademic    0.315241   0.159874   1.972 0.048631 *  
## progVocational  0.127720   0.187238   0.682 0.495161    
## math            0.027879   0.006517   4.278 1.89e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 105.241  on 199  degrees of freedom
## Residual deviance:  68.215  on 196  degrees of freedom
## AIC: 538.66
## 
## Number of Fisher Scoring iterations: 4

  head(predict(award.pois,type="response"))

##         1         2         3         4         5         6 
## 1.0383551 0.9138565 1.1289346 1.0677106 1.0098068 0.9396922

  award.grid <- expand.grid( math=seq(32,70,2)
                             ,prog=factor(1:3, labels=c("General", "Academic", "Vocational"))
                             ,num_awards=NA)
  
  award.grid$pois_awards <- predict(award.pois,newdata = award.grid,type="response")

  xtabs(pois_awards~prog+math,award.grid)

##             math
## prog                32        34        36        38        40        42
##   General    0.7110651 0.7518386 0.7949501 0.8405337 0.8887311 0.9396922
##   Academic   0.9745785 1.0304622 1.0895504 1.1520268 1.2180857 1.2879324
##   Vocational 0.8079366 0.8542649 0.9032496 0.9550433 1.0098068 1.0677106
##             math
## prog                44        46        48        50        52        54
##   General    0.9935755 1.0505485 1.1107885 1.1744827 1.2418292 1.3130374
##   Academic   1.3617843 1.4398710 1.5224352 1.6097338 1.7020383 1.7996356
##   Vocational 1.1289346 1.1936693 1.2621160 1.3344876 1.4110090 1.4919183
##             math
## prog                56        58        60        62        64        66
##   General    1.3883289 1.4679376 1.5521113 1.6411115 1.7352152 1.8347149
##   Academic   1.9028292 2.0119402 2.1273077 2.2492906 2.3782681 2.5146414
##   Vocational 1.5774670 1.6679212 1.7635622 1.8646874 1.9716112 2.0846662
##             math
## prog                68        70
##   General    1.9399201 2.0511579
##   Academic   2.6588345 2.8112959
##   Vocational 2.2042039 2.3305961

  ##-- 
  # plot(jitter(award.DT$math),jitter(award.DT$count_awards),col=award.DT$prog,pch=16,cex=0.5,xlab="Math",ylab="# awards")

What is your conclusion?

03: compare

compare the Poisson regression with simple linear model and calculate AIC of each model

  ### SOLUTION TO QUESTION 4C ### 

  award.lm   <- lm(count_awards ~ prog + math, data=award.DT)
  summary(award.lm)

## 
## Call:
## lm(formula = count_awards ~ prog + math, data = award.DT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7311 -0.5618 -0.1537  0.2851  4.4126 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.195504   0.411417  -2.906  0.00408 ** 
## progAcademic    0.478613   0.168956   2.833  0.00510 ** 
## progVocational  0.212506   0.187433   1.134  0.25828    
## math            0.047889   0.007773   6.161 4.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9019 on 196 degrees of freedom
## Multiple R-squared:  0.2773, Adjusted R-squared:  0.2662 
## F-statistic: 25.07 on 3 and 196 DF,  p-value: 9.016e-14

  award.grid$lm_awards <- predict(award.lm, newdata=award.grid,type="response")
  xtabs(lm_awards~prog+math,award.grid)

##             math
## prog                32        34        36        38        40        42
##   General    0.3369374 0.4327150 0.5284926 0.6242702 0.7200478 0.8158254
##   Academic   0.8155504 0.9113280 1.0071056 1.1028832 1.1986608 1.2944384
##   Vocational 0.5494435 0.6452211 0.7409987 0.8367763 0.9325539 1.0283315
##             math
## prog                44        46        48        50        52        54
##   General    0.9116030 1.0073806 1.1031582 1.1989358 1.2947134 1.3904910
##   Academic   1.3902160 1.4859936 1.5817712 1.6775488 1.7733264 1.8691039
##   Vocational 1.1241091 1.2198867 1.3156643 1.4114419 1.5072195 1.6029971
##             math
## prog                56        58        60        62        64        66
##   General    1.4862686 1.5820462 1.6778238 1.7736014 1.8693790 1.9651566
##   Academic   1.9648815 2.0606591 2.1564367 2.2522143 2.3479919 2.4437695
##   Vocational 1.6987747 1.7945523 1.8903299 1.9861075 2.0818851 2.1776627
##             math
## prog                68        70
##   General    2.0609342 2.1567118
##   Academic   2.5395471 2.6353247
##   Vocational 2.2734403 2.3692179

  xtabs(pois_awards~prog+math,award.grid)

##             math
## prog                32        34        36        38        40        42
##   General    0.7110651 0.7518386 0.7949501 0.8405337 0.8887311 0.9396922
##   Academic   0.9745785 1.0304622 1.0895504 1.1520268 1.2180857 1.2879324
##   Vocational 0.8079366 0.8542649 0.9032496 0.9550433 1.0098068 1.0677106
##             math
## prog                44        46        48        50        52        54
##   General    0.9935755 1.0505485 1.1107885 1.1744827 1.2418292 1.3130374
##   Academic   1.3617843 1.4398710 1.5224352 1.6097338 1.7020383 1.7996356
##   Vocational 1.1289346 1.1936693 1.2621160 1.3344876 1.4110090 1.4919183
##             math
## prog                56        58        60        62        64        66
##   General    1.3883289 1.4679376 1.5521113 1.6411115 1.7352152 1.8347149
##   Academic   1.9028292 2.0119402 2.1273077 2.2492906 2.3782681 2.5146414
##   Vocational 1.5774670 1.6679212 1.7635622 1.8646874 1.9716112 2.0846662
##             math
## prog                68        70
##   General    1.9399201 2.0511579
##   Academic   2.6588345 2.8112959
##   Vocational 2.2042039 2.3305961

  AIC(award.pois)

## [1] 538.6557

  AIC(award.lm)

## [1] 532.2495

Question 5

Consider the dataset Professor Salaries and follow these instruction.

01: import

import data from MS Excel file with package `xlsx`

  ### SOLUTION TO QUESTION 5A ### 
 
  require(openxlsx)
  fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/Salaries.xlsx" 

  ##-- import sheet by name
  prof <- read.xlsx(xlsxFile=fileURL,sheet = "Salaries") 
  
  ##-- import sheet by index
  prof <- read.xlsx(xlsxFile=fileURL, sheet=1)   
  glimpse(prof) 
    
  ##-- import specific row/col of sheet index
  wb       <- loadWorkbook("Salaries.xlsx") ##-- where is your file?  
  sheets   <- getSheets(wb)
  sheet    <- sheets[[1]] ##-- access to first sheet
  nLastRow <- sheet$getLastRowNum()
  prof     <- as.data.frame(readRows(sheet,startRow=2,endRow=nLastRow,startColumn = 1)) 
  colnames(prof) <- readRows(sheet,startRow=1,endRow=1,startColumn = 1)                          
  glimpse(prof)  ##-- this leads to ERROR; can you explain WHY? 
  
  
  require(dplyr)
  require(lattice)
  ##-- convert data into appropiate type
  prof <- mutate(prof,rank=factor(rank,labels=c("AsstProf","AssocProf","Prof")),
                 discipline=factor(discipline),
                 yrs.since.phd=as.numeric(yrs.since.phd) ,
                 yrs.service=as.numeric(yrs.service) ,
                 sex=factor(sex,labels=c("M","F")),
                 salary=as.numeric(salary) ) 
  
  ##-- play with data
  alys01 <- filter(prof,discipline=="B") %>% group_by(rank,sex) %>% summarise(avg=mean(salary),n=length(salary))
  xtabs(avg~sex+rank,data=alys01)

02: observe

observe the pattern of `salary` and relationship of each column

  ### SOLUTION TO QUESTION 5B ### 

  glimpse(prof) ## data contain quantitative (number) // qualitative (description)
  
  ##-- analysis of qualitative data (description)
  bwplot(salary~rank|discipline:sex,data=prof) ##-- is there any diff in terms of rank, discipline, and sex 
  anova(lm(salary~rank*sex*discipline,data=prof))

  ##-- analysis of quantitative (number) data
  cor(prof[,c(3,4,6)])
  anova(lm(salary~yrs.since.phd+yrs.service,data=prof))
  pairs(prof[,c(3,4,6)],col=prof$discipline)
  
  ##-- analysis of both  
  xyplot(salary~yrs.since.phd|(discipline:rank),data=prof)
  xyplot(salary~yrs.service|(discipline:rank),data=prof)  
  prof.lm  <- step( lm(salary~.,data=prof)) ## find the best model
  anova(prof.lm)

02: `hist()`

visualize histogram of salary base on main factors

  ### SOLUTION TO QUESTION 5C ### 

  prof.hist <- hist(prof$salary,levels=20,plot = F)
  prof.B    <- filter(prof,discipline=="B")
  prof.Asst <- filter(prof,rank=="AsstProf")
  prof.nFull<- filter(prof,rank!="Prof")
  
  hist(prof$salary,breaks=prof.hist$breaks,col="blue")
  hist(prof.B$salary,breaks=prof.hist$breaks,col="red",add=T)
  rug(prof$salary,col="blue")
  rug(prof.B$salary,col="red")  
  
  hist(prof$salary,breaks=prof.hist$breaks,col="blue")
  hist(prof.nFull$salary,breaks=prof.hist$breaks,col="red",add=T)
  hist(prof.Asst$salary,breaks=prof.hist$breaks,col="orange",add=T)  
  rug(prof$salary,col="blue")
  rug(prof.nFull$salary,col="red")
  rug(prof.Asst$salary,col="orange")

03: OLR

predict salary with ordinary linear regression model

  ### SOLUTION TO QUESTION 5D ### 

  prof.ctree <- ctree(salary~.,data=prof)
  plot(prof.ctree)
  summary(prof.ctree)

04: DT

use decision tree (`ctree` in `party` package or `rpart` in `rpart` package) to analysis data

  ### SOLUTION TO QUESTION 5E ### 

  prof.ctree <- ctree(salary~.,data=prof)
  plot(prof.ctree)
  summary(prof.ctree)

Workshop 06: Regression and Time Series

oran.k@chula.ac.th

August 2021

Instruction

Question 1

00: background

01: ts class

02: Slice

03: visual

plot the decomposition of the data and separate data into training set (1948-1958) and testing set (1959)

Plot

Decompose

Note plot() is still work, but not popular. This is equivalent to stat::ts.plot(). forecast::autoplot() and forecast::autolayer() are more popular.

04: smoothing

develop the following smoothing forecasting models and compute accuracy

Outlier

Smoothing Group

ARIMA Group

NOTE forecast::Arima overwrites stat::arima(). This is a reason, why we cannot use arg ‘h’ in Arima() and use tsplot() or autoplot() without forecast().

Other Group

Autolayer

05: API

develop DOS/Java application for exponential series forecasting that takes any csv file as an input (hint: this require seperate R script and basic on Batch file)

idea

export

fgui

DOS

Question 2

01: import

import/check for NA value separate into groups based on availability of data

02: separate

03: compare

04: predict

05: visualize

06: Assumption

Question 3

01: import

02: construct

99: EXTRA

Quesiton 4

01: import

02: glm()

03: compare

Question 5

01: import

import data from MS Excel file with package xlsx

02: observe

observe the pattern of **salary** and relationship of each column

02: hist()

visualize histogram of salary base on main factors

03: OLR

predict salary with ordinary linear regression model

04: DT

use decision tree (ctree in party package or rpart in rpart package) to analysis data

01: `ts` class

Note `plot()` is still work, but not popular. This is equivalent to `stat::ts.plot()`. `forecast::autoplot()` and `forecast::autolayer()` are more popular.

NOTE `forecast::Arima` overwrites `stat::arima()`. This is a reason, why we cannot use arg ‘h’ in `Arima()` and use `tsplot()` or `autoplot()` without `forecast()`.

import data from MS Excel file with package `xlsx`

observe the pattern of `salary` and relationship of each column

02: `hist()`

use decision tree (`ctree` in `party` package or `rpart` in `rpart` package) to analysis data