Read the instruction carefully and think about how to develop R code to answer each questions.
Consider AirPassengers
dataset and
explore the following data mining procedure for a time series data
The time series in R are encoded as ‘ts’ class (an extension of ‘vector’) and performed through many libraries:
autoplot()
and autolayer()
ts
classA basic information about data in ‘ts’ class can be derived using
tsp()
to reveal start, end, and
frequency of time series data
tsp(AirPassengers)
## [1] 1949.000 1960.917 12.000
The command reveals start()
, end()
, and
frequency()
of time series data. Notice that ‘ts’ data are
converted into fraction number, particular with interval of
deltat()
start(AirPassengers)
## [1] 1949 1
end(AirPassengers)
## [1] 1960 12
frequency(AirPassengers)
## [1] 12
deltat(AirPassengers)
## [1] 0.08333333
NOTE: Full description of ‘ts’ object can be done by:
head(time(AirPassengers))
## [1] 1949.000 1949.083 1949.167 1949.250 1949.333 1949.417
head(cycle(AirPassengers))
## [1] 1 2 3 4 5 6
The ‘ts’ data slice can be done with window()
air.train <- window(AirPassengers,start=1948.0,end=1958.917)
air.test <- window(AirPassengers,start=1959.0,end=1959.917)
The ‘ts’ data plot can be done with
forecast::tsdisplay()
with option of select type of
lower.right with arg plot.type
. Some even ignores type of
plot, using forecast::autoplot()
forecast::tsdisplay(air.train)
forecast::tsdisplay(air.train, plot.type = "histogram")
forecast::autoplot(air.train)
decompose can be done using decompose()
with “additive”
or “multiplicative”. Each term can be extracted by
forecast::trendcycle()
forecast::seasadj()
and
forecast::remainder()
, respectively.
forecast::autoplot(decompose(AirPassengers,type="additive"))
air.decom <- decompose(air.train,type="multiplicative")
detail of decomposion can be called by
forecast::trendcycle()
forecast::seasadj()
forecast::remainder(
head(forecast::trendcycle(air.decom),36)
## Jan Feb Mar Apr May Jun Jul Aug
## 1949 NA NA NA NA NA NA 126.7917 127.2500
## 1950 131.2500 133.0833 134.9167 136.4167 137.4167 138.7500 140.9167 143.1667
## 1951 157.1250 159.5417 161.8333 164.1250 166.6667 169.0833 171.2500 173.5833
## Sep Oct Nov Dec
## 1949 127.9583 128.5833 129.0000 129.7500
## 1950 145.7083 148.4167 151.5417 154.7083
## 1951 175.4583 176.8333 178.0417 180.1667
#forecast::seasadj(air.decom)
#forecast::remainder(air.decom)
plot()
is still work, but not
popular. This is equivalent to stat::ts.plot()
.
forecast::autoplot()
and forecast::autolayer()
are more popular.plot(AirPassengers)
ts.plot(air.train)
forecast
package can be used to verify outliers using
Friedman’s super smoother with command
forecast::tsoutliers()
### SOLUTION TO QUESTION 1B ###
require(forecast)
tsoutliers(air.train)
## $index
## [1] 79 91 103 104 115 116
##
## $replacements
## [1] 334.0010 392.7796 435.2084 441.3101 443.7989 445.3073
forecast
package embedded many exponential smoothing
functions:
forecast::ses()
Simple exponential
smoothing modelforecast::holt()
Holt’s linear trend
modelforecast::hw()
Holt-Winters’
modelThey are parts of ets()
a general exponential smoothing
state space model ETS.
air.ses <- ses(air.train,h=12)
air.holt <- holt(air.train,h=12)
air.hw <- hw(air.train,h=12)
autoplot(air.ses)
forecast
package also have accuracy computation
##-- accuracy
accuracy(air.test,air.hw$mean)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -61.87019 72.27274 61.87019 -16.24225 16.24225 0.6658303 2.835781
Before executing ARIMA, you should check acf and pacf with
forecast::Acf()
and forecast::Pacf()
,
respectively
forecast::Acf(air.train)
forecast::Acf(diff(air.train))
forecast::Pacf(air.train)
forecast::Pacf(diff(air.train))
air.arima <- Arima(air.train,c(0,1,1),seasonal = list(order=c(0,1,1),period=12))
##-- auto optimize
air.arima <- auto.arima(air.train)
autoplot(forecast(air.arima,h=12))
checkresiduals(air.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,0)(0,1,0)[12]
## Q* = 32.841, df = 23, p-value = 0.08388
##
## Model df: 1. Total lags used: 24
forecast::Arima
overwrites
stat::arima()
. This is a reason, why we cannot use arg ‘h’
in Arima()
and use tsplot()
or
autoplot()
without forecast()
.##-- possible forecasing model
arima(air.train,c(0,1,1),seasonal = list(order=c(0,1,1),period=12))
##
## Call:
## arima(x = air.train, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1),
## period = 12))
##
## Coefficients:
## ma1 sma1
## -0.2278 -0.0459
## s.e. 0.0993 0.0933
##
## sigma^2 estimated as 102.8: log likelihood = -399.7, aic = 805.41
plot(air.arima)
plot(naive(air.train,h=12))
plot(meanf(air.train,h=12))
plot(rwf(air.train,h=12))
plot(thetaf(air.train,h=12))
plot(stlf(air.train,h=12))
Many forecasting can visually combine using autolayer()
and autoplot()
require(ggplot2)
air.naive <- naive(air.train,h=12)
hSize <- 12
autoplot(air.train) +
autolayer(forecast(air.naive,h=hSize),series="naive",PI=F) +
autolayer(forecast(air.holt ,h=hSize),series="Holt",PI=F) +
autolayer(forecast(air.arima,h=hSize),series="ARIMA",PI=F) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(limits = c(100,550),breaks=seq(100,500,50), expand = c(0, 0)) +
xlab("Year") + ylab("Passengers") + theme_bw()
Here are step that we take for stand alone API
fgui
and tcltk
packages to build API
fGUI.R ## -------------------------------------------------------------------------
## export airPassenger Data
## -------------------------------------------------------------------------
write.csv(AirPassengers,file = "air.csv")
The r code in fGUI.R consists of four part
## set path for r library and apply
cmdArgs = commandArgs(trailingOnly = TRUE)
libPath <- cmdArgs[1]
.libPaths(c(libPath))
## load library (require .libPaths)
require(tcltk)
require(fgui) ## simple java script package
require(forecast)
## create sub-function
expoSeries <- function(period=12,opt="additive",alpha=0.5,
beta=0.05,csvFilename=NULL,predict=12) {
data <- read.csv( guiGetValue("csvFilename") ) [[2]]
data.ts <- ts(data,start=1949,frequency=period)
data.ht <- hw(data.ts,alpha=0.5,beta=0.01,seasonal=opt,h=predict)
windows()
plot(data.ht)
}
## run main GUI
gui(expoSeries,
title="GUI: Exponental Smoothing Series",
argSlider=list(alpha=c(0,1,0.1),
beta=c(0,.1,0.001)),
argFilename = list(csvFilename=NULL),
argOption = list(opt=c("additive","multiplicative"))
)
R\R-3.6.3\bin\x64
@echo off
:: SET VARIABLE ::
Set lib="C:/Users/Oran/Documents/R/win-library/3.6"
:: RUN R CMD ::
Rscript testScript.R %lib%
pause
Alternatively, you can use service of ‘shinyapps.io’ with shinny using the following source code src
Consider self-report weight/height and actual report of such measures of 200 participants
### SOLUTION TO QUESTION 2A ###
fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/selfReport.csv"
self <- read.csv(url(fileURL),header = T)
head(self) ## View(self)
## sex weight height repwt repht
## 1 M 77 182 77 180
## 2 F 58 161 51 159
## 3 F 53 161 54 158
## 4 M 68 177 70 175
## 5 F 59 157 59 155
## 6 M 76 170 76 165
naSet<- as.data.frame(which(is.na(self),arr.ind = T))
naSmy<- xtabs(~col+row,data=naSet)
noRepWt<- as.numeric(names(naSmy[1,naSmy[1,]==0]))
noRepHt<- as.numeric(names(naSmy[2,naSmy[2,]==0]))
noBoth <- as.numeric(names(naSmy[1,naSmy[1,]==1 & naSmy[2,]==1]))
noNA <- setdiff(1:nrow(self),c(noRepWt,noRepHt,noBoth))
self.noNA <- self[noNA,]
complete.cases()
can be applied to
eliminate all ‘NA’ self.noNA <- self[complete.cases(self),]
### SOLUTION TO QUESTION 2B ###
set.seed(17)
testingSet<- sample(noNA,size=round(length(noNA)*.15))
self.test <- self[testingSet,]
self.train<- self[setdiff(noNA,testingSet),] ### why not "self[-testingSet,]"?
### SOLUTION TO QUESTION 2C ###
attach(self.train)
self.train$gapHt <- height - repht
self.train$gapWt <- weight - repwt
detach(self.train)
Does these commands work?
xtabs(~sex+gapHt,data=self.train) ## do you notice any error? how to correct it?
## gapHt
## sex -106 -6 -3 -2 -1 0 1 2 3 4 5 6 7 8
## F 1 1 0 2 1 8 8 21 25 14 2 2 0 0
## M 0 0 1 3 5 14 7 8 20 6 2 1 1 1
xtabs(~sex+gapWt,data=self.train) ## do you notice any error? how to correct it?
## gapWt
## sex -9 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 7 110
## F 0 0 2 0 0 6 12 29 15 7 4 6 2 1 1
## M 1 1 3 4 1 10 15 10 10 7 5 1 1 0 0
head(self.train) ## fix(self.train)
## sex weight height repwt repht gapHt gapWt
## 1 M 77 182 77 180 2 0
## 2 F 58 161 51 159 2 7
## 3 F 53 161 54 158 3 -1
## 5 F 59 157 59 155 2 0
## 7 M 76 167 77 165 2 -1
## 8 M 69 186 73 180 6 -4
temp <- xtabs(~gapHt+gapWt,data=self.train)
temp.mat <- as.matrix(temp)
filled.contour(z=temp.mat,
x=as.numeric(rownames(temp.mat)),
y=as.numeric(colnames(temp.mat)),
nlevels=10)
For training set, build a linear regression model to predict actual weight with actual height
### SOLUTION TO QUESTION 2D ###
weight.lm <- lm(weight~height+sex,data=self.train)
weight.predict <- predict(weight.lm,newdata=self.test)
#str(weight.lm)
coefficients(weight.lm)
## (Intercept) height sexM
## 123.0152610 -0.3965508 22.7796728
## residuals(weight.lm)
## fitted.values(weight.lm)
lm()
can be customized based on a user want. Here are
some interesing variation
weight~.
all variables (no interaction)weight~.-1
all variables (no interaction) and no
interceptweight~height*sex
all interaction between
height
and sex
weight~height:sex
only interaction between
height
and sex
weight~I(sqrt(height))
apply sqrt()
function to height
##-- advance construction of lm
##-- combination of variable, i.e., lm(weight~height+sex+height:sex,data=self.train)
# lm(weight~height*sex,data=self.train)
##-- interaction term
# lm(weight~height:sex,data=self.train)
##-- no constant term
# lm(weight~height:sex -1,data=self.train)
##-- apply function
# lm(weight~I(sqrt(height)),data=self.train)
##-- all main term
# lm(weight~.,data=self.train)
standard visualize of lm()
built in plot
oldpar <- par()
par(mfrow=c(2,2))
plot(weight.lm)
par(oldpar)
lm()
has some value build-in
anova(weight.lm)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## height 1 247.8 247.8 1.7252 0.191
## sex 1 13510.9 13510.9 94.0617 <2e-16 ***
## Residuals 151 21689.5 143.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(weight.lm)
## 2.5 % 97.5 %
## (Intercept) 93.6005950 152.4299271
## height -0.5758924 -0.2172091
## sexM 18.1389697 27.4203760
forecast::accuracy(weight.predict,self.test$weight)
## ME RMSE MAE MPE MAPE
## Test set 2.973163 13.1365 9.31695 1.648525 12.46379
cor
or
visual inspectioncar::durbinWatsonText()
or ols_vif_tol()
(VIF<5.0)ols_test_normality()
ols_plot_resid_fit()
The main package to identify these assumption is olsrr
require(olsrr)
# ols_plot_resid_hist(weight.lm)
# ols_plot_resid_fit(weight.lm) + theme_bw()
# ols_plot_cooksd_chart(weight.lm) ## check for major influence points
# ols_plot_added_variable(weight.lm)
# ols_plot_response(weight.lm)
olsrr::ols_vif_tol(weight.lm)
## Variables Tolerance VIF
## 1 height 0.6836651 1.462705
## 2 sexM 0.6836651 1.462705
ols_test_normality(weight.lm)
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9308 0.0000
## Kolmogorov-Smirnov 0.0716 0.4083
## Cramer-von Mises 12.5463 0.0000
## Anderson-Darling 1.3591 0.0016
## -----------------------------------------------
ols_test_correlation(weight.lm)
## [1] 0.9613993
ols_test_outlier(weight.lm)
## studentized_residual unadjusted_p_val bonferroni_p_val
## 12 14.54216 1.899326e-30 2.924962e-28
Apply your knowledge of logistic regression into the following dataset
### SOLUTION TO QUESTION 3A ###
fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/height.csv"
height <- read.csv(url(fileURL),header=T)
height$Gender <- factor(height$Gender,levels = c(1,2)
,labels = c("F","M"))
xtabs(~Gender+Height,data=height)
## Height
## Gender 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79
## F 2 3 1 2 4 3 6 12 6 5 3 1 2 0 0 0 0 0 0
## M 0 0 0 0 0 1 0 1 1 4 5 5 8 3 13 5 2 1 1
height.glm <- glm(Gender~Height,data=height,family = "binomial")
prob <- data.frame(Gender=factor(rep("M",21)),Height=60:80)
prob$prob<- predict(height.glm,newdata= prob,type = "response")
plot(prob$Height,prob$prob,col="black")
Davis
or
self.train
, construct a model to predict
sex of participants using other factors. ### SOLUTION TO QUESTION 3B ###
fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/selfReport.csv"
self <- read.csv(file=fileURL,header = T)
self$sex <- factor(self$sex)
self$gapHt <- self$height - self$repht
self$gapWt <- self$weight - self$repwt
noNA <- setdiff(1:200,unique(which(is.na(self),arr.ind = T)[,1]))
set.seed(17)
testingSet<- sample(noNA,size=round(length(noNA)*.15))
self.test <- self[testingSet,]
self.train<- self[setdiff(noNA,testingSet),] ### why not "self[-testingSet,]"?
sex.log <- glm(sex~gapWt+gapHt,data=self.train,family="binomial")
sex.best<- step(glm(sex~.,data=self.train,family="binomial"))
anova(sex.log)
anova(sex.best)
self.test$probSex<- predict(sex.log,newdata=self.test,type = "response")
self.test$bestSex<- predict(sex.best,newdata=self.test,type = "response")
self$sex
Consider the award dataset of difference student groups with different Mathematics score and follow the data mining steps
### SOLUTION TO QUESTION 4A ###
fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/poisson_sim.csv"
require(data.table)
award.DT <- as.data.table(read.csv(url(fileURL),header=T))
award.DT[,prog:= factor(prog, levels=1:3, labels=c("General", "Academic", "Vocational"))]
award.DT[,num_awards:=factor(num_awards)]
require(ggplot2)
require(dplyr)
award.DT %>% ggplot( aes(x=num_awards,y=math,fill=prog) ) +
geom_violin() + facet_grid(~prog)
award.DT[,.(min=min(math),avg=mean(math),max=max(math),sd=sd(math),.N),by=.(prog,num_awards)] -> temp.DT
xtabs(avg~prog+num_awards,temp.DT)
## num_awards
## prog 0 1 2 3 4 5 6
## General 49.44444 52.33333 0.00000 0.00000 0.00000 0.00000 0.00000
## Academic 52.47917 57.78125 64.36364 60.88889 66.00000 66.00000 69.00000
## Vocational 45.62500 45.62500 65.50000 0.00000 0.00000 0.00000 0.00000
xtabs(N~prog+num_awards,temp.DT)
## num_awards
## prog 0 1 2 3 4 5 6
## General 36 9 0 0 0 0 0
## Academic 48 32 11 9 2 2 1
## Vocational 40 8 2 0 0 0 0
award.DT[,count_awards:=as.numeric(num_awards)]
award.pois <- glm(count_awards ~ prog + math, family="poisson", data=award.DT)
summary(award.pois)
##
## Call:
## glm(formula = count_awards ~ prog + math, family = "poisson",
## data = award.DT)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.233114 0.362234 -3.404 0.000664 ***
## progAcademic 0.315241 0.159874 1.972 0.048631 *
## progVocational 0.127720 0.187238 0.682 0.495161
## math 0.027879 0.006517 4.278 1.89e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 105.241 on 199 degrees of freedom
## Residual deviance: 68.215 on 196 degrees of freedom
## AIC: 538.66
##
## Number of Fisher Scoring iterations: 4
head(predict(award.pois,type="response"))
## 1 2 3 4 5 6
## 1.0383551 0.9138565 1.1289346 1.0677106 1.0098068 0.9396922
award.grid <- expand.grid( math=seq(32,70,2)
,prog=factor(1:3, labels=c("General", "Academic", "Vocational"))
,num_awards=NA)
award.grid$pois_awards <- predict(award.pois,newdata = award.grid,type="response")
xtabs(pois_awards~prog+math,award.grid)
## math
## prog 32 34 36 38 40 42
## General 0.7110651 0.7518386 0.7949501 0.8405337 0.8887311 0.9396922
## Academic 0.9745785 1.0304622 1.0895504 1.1520268 1.2180857 1.2879324
## Vocational 0.8079366 0.8542649 0.9032496 0.9550433 1.0098068 1.0677106
## math
## prog 44 46 48 50 52 54
## General 0.9935755 1.0505485 1.1107885 1.1744827 1.2418292 1.3130374
## Academic 1.3617843 1.4398710 1.5224352 1.6097338 1.7020383 1.7996356
## Vocational 1.1289346 1.1936693 1.2621160 1.3344876 1.4110090 1.4919183
## math
## prog 56 58 60 62 64 66
## General 1.3883289 1.4679376 1.5521113 1.6411115 1.7352152 1.8347149
## Academic 1.9028292 2.0119402 2.1273077 2.2492906 2.3782681 2.5146414
## Vocational 1.5774670 1.6679212 1.7635622 1.8646874 1.9716112 2.0846662
## math
## prog 68 70
## General 1.9399201 2.0511579
## Academic 2.6588345 2.8112959
## Vocational 2.2042039 2.3305961
##--
# plot(jitter(award.DT$math),jitter(award.DT$count_awards),col=award.DT$prog,pch=16,cex=0.5,xlab="Math",ylab="# awards")
What is your conclusion?
### SOLUTION TO QUESTION 4C ###
award.lm <- lm(count_awards ~ prog + math, data=award.DT)
summary(award.lm)
##
## Call:
## lm(formula = count_awards ~ prog + math, data = award.DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7311 -0.5618 -0.1537 0.2851 4.4126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.195504 0.411417 -2.906 0.00408 **
## progAcademic 0.478613 0.168956 2.833 0.00510 **
## progVocational 0.212506 0.187433 1.134 0.25828
## math 0.047889 0.007773 6.161 4.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9019 on 196 degrees of freedom
## Multiple R-squared: 0.2773, Adjusted R-squared: 0.2662
## F-statistic: 25.07 on 3 and 196 DF, p-value: 9.016e-14
award.grid$lm_awards <- predict(award.lm, newdata=award.grid,type="response")
xtabs(lm_awards~prog+math,award.grid)
## math
## prog 32 34 36 38 40 42
## General 0.3369374 0.4327150 0.5284926 0.6242702 0.7200478 0.8158254
## Academic 0.8155504 0.9113280 1.0071056 1.1028832 1.1986608 1.2944384
## Vocational 0.5494435 0.6452211 0.7409987 0.8367763 0.9325539 1.0283315
## math
## prog 44 46 48 50 52 54
## General 0.9116030 1.0073806 1.1031582 1.1989358 1.2947134 1.3904910
## Academic 1.3902160 1.4859936 1.5817712 1.6775488 1.7733264 1.8691039
## Vocational 1.1241091 1.2198867 1.3156643 1.4114419 1.5072195 1.6029971
## math
## prog 56 58 60 62 64 66
## General 1.4862686 1.5820462 1.6778238 1.7736014 1.8693790 1.9651566
## Academic 1.9648815 2.0606591 2.1564367 2.2522143 2.3479919 2.4437695
## Vocational 1.6987747 1.7945523 1.8903299 1.9861075 2.0818851 2.1776627
## math
## prog 68 70
## General 2.0609342 2.1567118
## Academic 2.5395471 2.6353247
## Vocational 2.2734403 2.3692179
xtabs(pois_awards~prog+math,award.grid)
## math
## prog 32 34 36 38 40 42
## General 0.7110651 0.7518386 0.7949501 0.8405337 0.8887311 0.9396922
## Academic 0.9745785 1.0304622 1.0895504 1.1520268 1.2180857 1.2879324
## Vocational 0.8079366 0.8542649 0.9032496 0.9550433 1.0098068 1.0677106
## math
## prog 44 46 48 50 52 54
## General 0.9935755 1.0505485 1.1107885 1.1744827 1.2418292 1.3130374
## Academic 1.3617843 1.4398710 1.5224352 1.6097338 1.7020383 1.7996356
## Vocational 1.1289346 1.1936693 1.2621160 1.3344876 1.4110090 1.4919183
## math
## prog 56 58 60 62 64 66
## General 1.3883289 1.4679376 1.5521113 1.6411115 1.7352152 1.8347149
## Academic 1.9028292 2.0119402 2.1273077 2.2492906 2.3782681 2.5146414
## Vocational 1.5774670 1.6679212 1.7635622 1.8646874 1.9716112 2.0846662
## math
## prog 68 70
## General 1.9399201 2.0511579
## Academic 2.6588345 2.8112959
## Vocational 2.2042039 2.3305961
AIC(award.pois)
## [1] 538.6557
AIC(award.lm)
## [1] 532.2495
Consider the dataset Professor Salaries and follow these instruction.
xlsx
### SOLUTION TO QUESTION 5A ###
require(openxlsx)
fileURL <- "http://ie.eng.chula.ac.th/~oran/classes/compMeth_resources/workshop/scripts/data/Salaries.xlsx"
##-- import sheet by name
prof <- read.xlsx(xlsxFile=fileURL,sheet = "Salaries")
##-- import sheet by index
prof <- read.xlsx(xlsxFile=fileURL, sheet=1)
glimpse(prof)
##-- import specific row/col of sheet index
wb <- loadWorkbook("Salaries.xlsx") ##-- where is your file?
sheets <- getSheets(wb)
sheet <- sheets[[1]] ##-- access to first sheet
nLastRow <- sheet$getLastRowNum()
prof <- as.data.frame(readRows(sheet,startRow=2,endRow=nLastRow,startColumn = 1))
colnames(prof) <- readRows(sheet,startRow=1,endRow=1,startColumn = 1)
glimpse(prof) ##-- this leads to ERROR; can you explain WHY?
require(dplyr)
require(lattice)
##-- convert data into appropiate type
prof <- mutate(prof,rank=factor(rank,labels=c("AsstProf","AssocProf","Prof")),
discipline=factor(discipline),
yrs.since.phd=as.numeric(yrs.since.phd) ,
yrs.service=as.numeric(yrs.service) ,
sex=factor(sex,labels=c("M","F")),
salary=as.numeric(salary) )
##-- play with data
alys01 <- filter(prof,discipline=="B") %>% group_by(rank,sex) %>% summarise(avg=mean(salary),n=length(salary))
xtabs(avg~sex+rank,data=alys01)
**salary**
and relationship of
each column ### SOLUTION TO QUESTION 5B ###
glimpse(prof) ## data contain quantitative (number) // qualitative (description)
##-- analysis of qualitative data (description)
bwplot(salary~rank|discipline:sex,data=prof) ##-- is there any diff in terms of rank, discipline, and sex
anova(lm(salary~rank*sex*discipline,data=prof))
##-- analysis of quantitative (number) data
cor(prof[,c(3,4,6)])
anova(lm(salary~yrs.since.phd+yrs.service,data=prof))
pairs(prof[,c(3,4,6)],col=prof$discipline)
##-- analysis of both
xyplot(salary~yrs.since.phd|(discipline:rank),data=prof)
xyplot(salary~yrs.service|(discipline:rank),data=prof)
prof.lm <- step( lm(salary~.,data=prof)) ## find the best model
anova(prof.lm)
hist()
### SOLUTION TO QUESTION 5C ###
prof.hist <- hist(prof$salary,levels=20,plot = F)
prof.B <- filter(prof,discipline=="B")
prof.Asst <- filter(prof,rank=="AsstProf")
prof.nFull<- filter(prof,rank!="Prof")
hist(prof$salary,breaks=prof.hist$breaks,col="blue")
hist(prof.B$salary,breaks=prof.hist$breaks,col="red",add=T)
rug(prof$salary,col="blue")
rug(prof.B$salary,col="red")
hist(prof$salary,breaks=prof.hist$breaks,col="blue")
hist(prof.nFull$salary,breaks=prof.hist$breaks,col="red",add=T)
hist(prof.Asst$salary,breaks=prof.hist$breaks,col="orange",add=T)
rug(prof$salary,col="blue")
rug(prof.nFull$salary,col="red")
rug(prof.Asst$salary,col="orange")
### SOLUTION TO QUESTION 5D ###
prof.ctree <- ctree(salary~.,data=prof)
plot(prof.ctree)
summary(prof.ctree)
ctree
in
party
package or
rpart
in
rpart
package) to analysis data ### SOLUTION TO QUESTION 5E ###
prof.ctree <- ctree(salary~.,data=prof)
plot(prof.ctree)
summary(prof.ctree)
Copyright 2019 Oran Kittithreerapronchai. All Rights Reserved.
Last modified: 2023-46-07,