Read the instruction carefully and think about how to develop R code to answer each questions.
iris
dataset (one of the most
famous dataset in Data Mining) and learn basic command of
data.frame
packagedata.frame
object has mixed propeties of
matrix()
and list()
; hence, we can access the
object using both methods in four ways. ##-- data type
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##-- access as list
iris$Sepal.Length[1:10]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris[[1]][1:10]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
##-- access as matrix
iris[1:10,1]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
##-- access as matrix with name
iris[1:10,"Sepal.Length"]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
note
df.name <- "iris"
get(df.name)[1:3,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
data.frame
is to sample and review only option of data.
Here are two tips: ##-- show top/bottom 6
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
set.seed(13)
nSize <- 5
smp.idx <- sample(1:nrow(iris),nSize)
iris[smp.idx,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 4.7 3.2 1.3 0.2 setosa
## 101 6.3 3.3 6.0 2.5 virginica
## 74 6.1 2.8 4.7 1.2 versicolor
## 6 5.4 3.9 1.7 0.4 setosa
## 132 7.9 3.8 6.4 2.0 virginica
note what is this line do?
##-- This code is NOT execute
head(iris[-smp.idx,])
View()
edit()
and
fix()
. ### SOLUTION TO QUESTION 2A ###
temp <- iris
##-- view only; change are not allow
View(temp)
##-- view and allow change; ???
fix(temp)
##-- view and allow change; ???
edit(temp)
note
View()
,
edit()
, and fix()
data.frame
may useful. ##-- dimension
nDim <- dim(iris)
nCol <- ncol(iris)
nRow <- nrow(iris)
##-- column name
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
id.DF <- data.frame(id=1:nRow)
##-- joint data.frame
head(cbind.data.frame(id.DF,iris))
## id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 1 5.1 3.5 1.4 0.2 setosa
## 2 2 4.9 3.0 1.4 0.2 setosa
## 3 3 4.7 3.2 1.3 0.2 setosa
## 4 4 4.6 3.1 1.5 0.2 setosa
## 5 5 5.0 3.6 1.4 0.2 setosa
## 6 6 5.4 3.9 1.7 0.4 setosa
##-- check duplication
any(duplicated(iris))
## [1] TRUE
nrow(unique(iris))
## [1] 149
##-- find duplication index
which(duplicated(iris)==TRUE)
## [1] 143
note
summarytools
summarytools
package require(summarytools)
print(summarytools::dfSummary(iris),method = "render")
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Sepal.Length [numeric] |
|
35 distinct values | 150 (100.0%) | 0 (0.0%) | ||||||||||||||||
2 | Sepal.Width [numeric] |
|
23 distinct values | 150 (100.0%) | 0 (0.0%) | ||||||||||||||||
3 | Petal.Length [numeric] |
|
43 distinct values | 150 (100.0%) | 0 (0.0%) | ||||||||||||||||
4 | Petal.Width [numeric] |
|
22 distinct values | 150 (100.0%) | 0 (0.0%) | ||||||||||||||||
5 | Species [factor] |
|
|
150 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.3.0)
2023-09-13
note possible command in this package include,
summarytools::freq()
, summarytools::ctree()
and summarytools::descr()
. When you should apply these
commmands?
iris
, explore the data by
its classification with R packages. This question is separated into
three approaches with the identical result (the background of users and
personal experience play important role to how one select approach)
:base
is simple/naive way to explore.
no good for large and complex datasetdata.table
is extension of
base
structure using all core in your
machine (it is a little complex)dplyr
is extension of SQL with a
combination of data.frame
and
tibble
. It is a part of tidyverse, a set of package of
data science in R.base
base
### SOLUTION TO QUESTION 1Aa ###
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
##-- for numeric data
iris.numer <- iris[,1:4] ##-- only numeric data can be used
apply(iris.numer,2,mean) ##-- find mean
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
symnum(cor(iris.numer)) ##-- find covariance and sign
## S.L S.W P.L P.W
## Sepal.Length 1
## Sepal.Width 1
## Petal.Length + . 1
## Petal.Width + . B 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
require(moments)
##-- skewness is 3rd moment explaing concentration of data
apply(iris.numer,2,skewness)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.3117531 0.3157671 -0.2721277 -0.1019342
##-- kurtosis is 4rd moment explaing normaly distributed of data
apply(iris.numer,2,kurtosis)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2.426432 3.180976 1.604464 1.663933
##-- typical version01 of IQR
findIQR <- function(x){ quantile(x,prob=0.75) - quantile(x,prob=0.25)}
##-- typical version01 of IQR
apply(iris.numer,2,findIQR)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1.3 0.5 3.5 1.5
##-- inline version of IQR
apply(iris.numer,2,function(o){ quantile(o,prob=0.75) - quantile(o,prob=0.25)})
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1.3 0.5 3.5 1.5
##-- find (Coefficient of variation) CV or relative sd
apply(iris.numer,2,function(o){ sd(o)/mean(o) } )
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.1417113 0.1425642 0.4697441 0.6355511
##-- find mod of data ##Why use max and table
apply(iris.numer,2,function(o){ max(table(o))} )
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 10 26 13 29
base
package isSelect <- which(iris$Species == "versicolor" &
iris$Petal.Width > 1.0 &
iris$Petal.Width < 1.5
)
head(iris[isSelect,])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 51 7.0 3.2 4.7 1.4 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
### SOLUTION TO QUESTION 1B ###
simCol <- grep("Sepal",names(iris))
head(iris[,simCol])
## Sepal.Length Sepal.Width
## 1 5.1 3.5
## 2 4.9 3.0
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5.0 3.6
## 6 5.4 3.9
aggregate(Sepal.Width~Species,data=iris,mean)
## Species Sepal.Width
## 1 setosa 3.428
## 2 versicolor 2.770
## 3 virginica 2.974
aggregate(Sepal.Length~Species,data=iris,mean)
## Species Sepal.Length
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
base
package isSelect <- grepl('color',iris$Species)
head(iris[isSelect,])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
nrow(iris[isSelect,])
## [1] 50
base
packagePetal.Width
Sepal.Width
iris[order(iris$Petal.Width,-iris$Sepal.Width),]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 33 5.2 4.1 1.5 0.1 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 1 5.1 3.5 1.4 0.2 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 68 5.8 2.7 4.1 1.0 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 51 7.0 3.2 4.7 1.4 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 135 6.1 2.6 5.6 1.4 virginica
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 134 6.3 2.8 5.1 1.5 virginica
## 73 6.3 2.5 4.9 1.5 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 120 6.0 2.2 5.0 1.5 virginica
## 86 6.0 3.4 4.5 1.6 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 130 7.2 3.0 5.8 1.6 virginica
## 84 6.0 2.7 5.1 1.6 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 107 4.9 2.5 4.5 1.7 virginica
## 71 5.9 3.2 4.8 1.8 versicolor
## 126 7.2 3.2 6.0 1.8 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 150 5.9 3.0 5.1 1.8 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 101 6.3 3.3 6.0 2.5 virginica
## 145 6.7 3.3 5.7 2.5 virginica
type | width of petal | length of petal |
---|---|---|
low | \([0.00,0.75)\) | \([0.0,2.5)\) |
medium | \([0.75,1.75)\) | \([2.5,5.0)\) |
high | \([1.75,\infty)\) | \([5.0,\infty)\) |
##-- This code is NOT execute
iris.DF <- iris
iris.DF$tWidth <- ifelse(iris.DF$Petal.Width<0.75,"low",
ifelse(iris.DF$Petal.Width<1.75,"mid","high"))
iris.DF$tLength <- ifelse(iris.DF$Petal.Length<2.50,"low",
ifelse(iris.DF$Petal.Length<5.00,"mid","high"))
ftable(tWidth+tLength~Species,data=iris.DF)
data.table
data.table
is a compact and quick package for
transforming data in R based on the following structure.data.table
### SOLUTION TO QUESTION 1Ab ###
require(data.table)
iris.DT <- as.data.table(iris)
head(iris.DT)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 4.9 3.0 1.4 0.2 setosa
## 3: 4.7 3.2 1.3 0.2 setosa
## 4: 4.6 3.1 1.5 0.2 setosa
## 5: 5.0 3.6 1.4 0.2 setosa
## 6: 5.4 3.9 1.7 0.4 setosa
iris.num.DT <- as.data.table(iris[,1:4])
iris.num.DT[,lapply(.SD,mean)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 5.843333 3.057333 3.758 1.199333
iris.num.DT[,lapply(.SD,quantile,prob=0.5)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 5.8 3 4.35 1.3
iris.num.DT[,lapply(.SD,sd)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 0.8280661 0.4358663 1.765298 0.7622377
require(moments)
iris.num.DT[,lapply(.SD,skewness)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 0.3117531 0.3157671 -0.2721277 -0.1019342
iris.num.DT[,lapply(.SD,kurtosis)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 2.426432 3.180976 1.604464 1.663933
iris.num.DT[,lapply(.SD,function(o){ quantile(o,prob=0.75) - quantile(o,prob=0.25)})]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 1.3 0.5 3.5 1.5
data.table
package head(iris.DT[Species=="versicolor" & between(Petal.Width,1.0,1.5)])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 7.0 3.2 4.7 1.4 versicolor
## 2: 6.4 3.2 4.5 1.5 versicolor
## 3: 6.9 3.1 4.9 1.5 versicolor
## 4: 5.5 2.3 4.0 1.3 versicolor
## 5: 6.5 2.8 4.6 1.5 versicolor
## 6: 5.7 2.8 4.5 1.3 versicolor
### SOLUTION TO QUESTION 1B ###
simCol <- names(iris.DT)[which(names(iris.DT) %ilike% 'Sepal')]
head(iris.DT[,..simCol],3)
## Sepal.Length Sepal.Width
## 1: 5.1 3.5
## 2: 4.9 3.0
## 3: 4.7 3.2
head(iris.DT[,.SD,.SDcols=simCol],3)
## Sepal.Length Sepal.Width
## 1: 5.1 3.5
## 2: 4.9 3.0
## 3: 4.7 3.2
head(iris.DT[,mean(Petal.Length),by=Species])
## Species V1
## 1: setosa 1.462
## 2: versicolor 4.260
## 3: virginica 5.552
data.table
iris.DT[Species%like% 'color',]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 7.0 3.2 4.7 1.4 versicolor
## 2: 6.4 3.2 4.5 1.5 versicolor
## 3: 6.9 3.1 4.9 1.5 versicolor
## 4: 5.5 2.3 4.0 1.3 versicolor
## 5: 6.5 2.8 4.6 1.5 versicolor
## 6: 5.7 2.8 4.5 1.3 versicolor
## 7: 6.3 3.3 4.7 1.6 versicolor
## 8: 4.9 2.4 3.3 1.0 versicolor
## 9: 6.6 2.9 4.6 1.3 versicolor
## 10: 5.2 2.7 3.9 1.4 versicolor
## 11: 5.0 2.0 3.5 1.0 versicolor
## 12: 5.9 3.0 4.2 1.5 versicolor
## 13: 6.0 2.2 4.0 1.0 versicolor
## 14: 6.1 2.9 4.7 1.4 versicolor
## 15: 5.6 2.9 3.6 1.3 versicolor
## 16: 6.7 3.1 4.4 1.4 versicolor
## 17: 5.6 3.0 4.5 1.5 versicolor
## 18: 5.8 2.7 4.1 1.0 versicolor
## 19: 6.2 2.2 4.5 1.5 versicolor
## 20: 5.6 2.5 3.9 1.1 versicolor
## 21: 5.9 3.2 4.8 1.8 versicolor
## 22: 6.1 2.8 4.0 1.3 versicolor
## 23: 6.3 2.5 4.9 1.5 versicolor
## 24: 6.1 2.8 4.7 1.2 versicolor
## 25: 6.4 2.9 4.3 1.3 versicolor
## 26: 6.6 3.0 4.4 1.4 versicolor
## 27: 6.8 2.8 4.8 1.4 versicolor
## 28: 6.7 3.0 5.0 1.7 versicolor
## 29: 6.0 2.9 4.5 1.5 versicolor
## 30: 5.7 2.6 3.5 1.0 versicolor
## 31: 5.5 2.4 3.8 1.1 versicolor
## 32: 5.5 2.4 3.7 1.0 versicolor
## 33: 5.8 2.7 3.9 1.2 versicolor
## 34: 6.0 2.7 5.1 1.6 versicolor
## 35: 5.4 3.0 4.5 1.5 versicolor
## 36: 6.0 3.4 4.5 1.6 versicolor
## 37: 6.7 3.1 4.7 1.5 versicolor
## 38: 6.3 2.3 4.4 1.3 versicolor
## 39: 5.6 3.0 4.1 1.3 versicolor
## 40: 5.5 2.5 4.0 1.3 versicolor
## 41: 5.5 2.6 4.4 1.2 versicolor
## 42: 6.1 3.0 4.6 1.4 versicolor
## 43: 5.8 2.6 4.0 1.2 versicolor
## 44: 5.0 2.3 3.3 1.0 versicolor
## 45: 5.6 2.7 4.2 1.3 versicolor
## 46: 5.7 3.0 4.2 1.2 versicolor
## 47: 5.7 2.9 4.2 1.3 versicolor
## 48: 6.2 2.9 4.3 1.3 versicolor
## 49: 5.1 2.5 3.0 1.1 versicolor
## 50: 5.7 2.8 4.1 1.3 versicolor
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
iris.DT[Species%like% 'color',.N]
## [1] 50
note * data.table::uniqueN()
can use to
count number of unique data
uniqueN(iris.DT)
## [1] 149
data.table
packaage order data byPetal.Width
Sepal.Width
iris.DT[order(Petal.Width,-Sepal.Width)]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.2 4.1 1.5 0.1 setosa
## 2: 4.9 3.6 1.4 0.1 setosa
## 3: 4.9 3.1 1.5 0.1 setosa
## 4: 4.8 3.0 1.4 0.1 setosa
## 5: 4.3 3.0 1.1 0.1 setosa
## ---
## 146: 6.7 3.1 5.6 2.4 virginica
## 147: 5.8 2.8 5.1 2.4 virginica
## 148: 7.2 3.6 6.1 2.5 virginica
## 149: 6.3 3.3 6.0 2.5 virginica
## 150: 6.7 3.3 5.7 2.5 virginica
type | width of petal | length of petal |
---|---|---|
low | \([0.00,0.75)\) | \([0.0,2.5)\) |
medium | \([0.75,1.75)\) | \([2.5,5.0)\) |
high | \([1.75,\infty)\) | \([5.0,\infty)\) |
wRange <- c(0.00,0.75,1.75,10.0)
wLabel <- c("low","mid","high")
lRange <- c(0.00,2.50,5.00,15.0)
lLabel <- c("low","mid","high")
iris.DT[,tWidth :=cut(Petal.Width ,wRange,wLabel)]
iris.DT[,tLength:=cut(Petal.Length,lRange,lLabel)]
iris.DT[,.N,by=.(tWidth,tLength,Species)]
## tWidth tLength Species N
## 1: low low setosa 50
## 2: mid mid versicolor 48
## 3: high mid versicolor 1
## 4: mid high versicolor 1
## 5: high high virginica 38
## 6: mid mid virginica 2
## 7: high mid virginica 7
## 8: mid high virginica 3
dplyr
dplyr
is a part of tidyr
for Data
Transformation. It bases on a function of SQL language that consists
of:dplyr |
SQL | desp |
---|---|---|
‘select()’ | SELECT | picks column |
‘filter()’ | WHERE | picks cases based on their values. |
‘group_by’ | GROUP BY | group data |
‘summarise()’ | - | reduces column into a summary. |
‘arrange()’ | ORDER BY | order rows |
‘join()’ | JOIN | join data |
‘mutate()’ | COLUMN AILAS | adds new column |
dplyr
### SOLUTION TO QUESTION 1Ab ###
require(dplyr)
iris.numer <- iris[,1:4]
summarise_all(iris.numer,.funs=mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.843333 3.057333 3.758 1.199333
require(moments)
summarise_all(iris.numer,.funs=skewness)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 0.3117531 0.3157671 -0.2721277 -0.1019342
summarise_all(iris.numer,.funs=function(o){
quantile(o,prob=0.75) - quantile(o,prob=0.25)
} )
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1.3 0.5 3.5 1.5
note
glimpse()
is the alternative version of
str()
require(dplyr)
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
dplyr
package require(dplyr)
head(filter(iris,between(Petal.Width,1.0,1.5) & Species == "versicolor"))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.0 3.2 4.7 1.4 versicolor
## 2 6.4 3.2 4.5 1.5 versicolor
## 3 6.9 3.1 4.9 1.5 versicolor
## 4 5.5 2.3 4.0 1.3 versicolor
## 5 6.5 2.8 4.6 1.5 versicolor
## 6 5.7 2.8 4.5 1.3 versicolor
##-- chain version
iris %>% filter(between(Petal.Width,1.0,1.5) & Species == "versicolor") -> iris.filter
head(iris.filter)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.0 3.2 4.7 1.4 versicolor
## 2 6.4 3.2 4.5 1.5 versicolor
## 3 6.9 3.1 4.9 1.5 versicolor
## 4 5.5 2.3 4.0 1.3 versicolor
## 5 6.5 2.8 4.6 1.5 versicolor
## 6 5.7 2.8 4.5 1.3 versicolor
dplyr
package ##-- This code is NOT execute
select(iris, contains("Sepal"),"Species") -> iris.dp
summarise_all(group_by(iris.dp,Species),mean)
dplyr
package head(filter(iris,grepl("color",Species) ))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.0 3.2 4.7 1.4 versicolor
## 2 6.4 3.2 4.5 1.5 versicolor
## 3 6.9 3.1 4.9 1.5 versicolor
## 4 5.5 2.3 4.0 1.3 versicolor
## 5 6.5 2.8 4.6 1.5 versicolor
## 6 5.7 2.8 4.5 1.3 versicolor
nrow(filter(iris,grepl("color",Species) ))
## [1] 50
dplyr
package ##-- This code is NOT execute
head(arrange(iris,Petal.Width,-Sepal.Width))
type | width of petal | length of petal |
---|---|---|
low | \([0.00,0.75)\) | \([0.0,2.5)\) |
medium | \([0.75,1.75)\) | \([2.5,5.0)\) |
high | \([1.75,\infty)\) | \([5.0,\infty)\) |
### SOLUTION TO QUESTION 1F ###
wRange <- c(0.75,1.75)
lRange <- c(2.50,5.00)
iris.dply <- iris
iris.dply <- mutate(iris.dply,tWidth=case_when(
Petal.Width < 0.75 ~ "low",
Petal.Width > 1.75 ~ "high",
TRUE ~ "mid"
) )
iris.dply <- mutate(iris.dply,tLength=case_when(
Petal.Length < 2.50 ~ "low",
Petal.Length > 5.00 ~ "high",
TRUE ~ "mid"
) )
glimpse(iris.dply)
## Rows: 150
## Columns: 7
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
## $ tWidth <chr> "low", "low", "low", "low", "low", "low", "low", "low", "…
## $ tLength <chr> "low", "low", "low", "low", "low", "low", "low", "low", "…
ftable(tWidth+tLength~Species,data=iris.dply)
## tWidth high low mid
## tLength high low mid high low mid high low mid
## Species
## setosa 0 0 0 0 50 0 0 0 0
## versicolor 0 0 1 0 0 0 1 0 48
## virginica 38 0 7 0 0 0 3 0 2
iris
in the
following stepPetal.Length
into equal groups, (i.e.,
‘PL.H’,‘PL.M’, ‘PL.L’)Sapel.Length
into equal groups, (i.e.,
‘SL.H’,‘SL.M’, ‘SL.L’)Petal.Width
of
groupsSepal.Width
of
groupsThe result should be similar to this table
hint There are two possible packages for this task:
reshape2
and data.table
reshape2::melt()
and
reshape2::dcast()
data.table::melt()
and
data.table::dcast()
iris
dataset using standard
base
package and then lattic
package for its
classification. Please think about its outliers as you need the
observation for the next question:NOTE
text()
identify()
is a powerful command to do semi-manual
labelingNOTE
tickers under box are points values using
rug()
combine panel can be achieved using
par(mfrow=c(2,2))
boxplot
is a powerful plot to check distribution and
outlier.
oldPar <- par()
boxplot(iris,col="gray")
par(mfcol=c(2,2))
boxplot(Sepal.Length~Species,data=iris,col="gray")
boxplot(Sepal.Width~Species,data=iris,col="gray")
boxplot(Petal.Length~Species,data=iris,col="gray")
boxplot(Petal.Width~Species,data=iris,col="gray")
par(oldPar)
stem(iris[,2])
##
## The decimal point is 1 digit(s) to the left of the |
##
## 20 | 0
## 21 |
## 22 | 000
## 23 | 0000
## 24 | 000
## 25 | 00000000
## 26 | 00000
## 27 | 000000000
## 28 | 00000000000000
## 29 | 0000000000
## 30 | 00000000000000000000000000
## 31 | 00000000000
## 32 | 0000000000000
## 33 | 000000
## 34 | 000000000000
## 35 | 000000
## 36 | 0000
## 37 | 000
## 38 | 000000
## 39 | 00
## 40 | 0
## 41 | 0
## 42 | 0
## 43 |
## 44 | 0
sunflowerplot(iris[,1],iris[,2])
pairs()
and hists()
pairs()
and
hist()
They are basic for understand
distribution and relationship ### SOLUTION TO QUESTION 2B ###
iris.jitter <- apply(iris[,1:4],2,function(o){jitter(o)})
pairs(iris.jitter,col=iris$Species)
hist(iris$Sepal.Length,n=10,col="grey",freq = F)
rug(jitter(iris$Sepal.Length))
points(density(iris$Sepal.Length),col="red",type="l")
bwplot()
in
lattice
package to visualize its
classification and compare with
boxplot()
### SOLUTION TO QUESTION 2C ###
boxplot(Sepal.Length~Species,data=iris,col="grey" )
boxplot(iris[[2]]~iris$Species,col="grey" ) ##-- alternative version using list
boxplot()
to plot
Species overlapped iris.seto <- iris[1:50,]
iris.virg <- iris[which(iris$Species=="virginica"),]
iris.vers <- iris[51:100,]
boxplot(iris.vers[1:4],pch=16,cex=0.5,col="blue")
boxplot(iris.seto[1:4],pch=16,cex=0.5,col="orange",add=T)
boxplot(iris.virg[1:4],pch=16,cex=0.5,col="#F0FF00AA",add=T)
lattice
package provides
a scatter plot using xyplot()
require(lattice)
xyplot(iris[[2]]~iris[[1]]|iris[[5]])
bwplot(Sepal.Length~factor(ceiling(Sepal.Width)) |Species,data=iris,add=T)
boxplot()
by saving its in
another vairable. For example, tempBox <- boxplot(iris[[1]]~iris[[5]],plot=F)
str(tempBox)
## List of 6
## $ stats: num [1:5, 1:3] 4.3 4.8 5 5.2 5.8 4.9 5.6 5.9 6.3 7 ...
## $ n : num [1:3] 50 50 50
## $ conf : num [1:2, 1:3] 4.91 5.09 5.74 6.06 6.34 ...
## $ out : num 4.9
## $ group: num 3
## $ names: chr [1:3] "setosa" "versicolor" "virginica"
which()
. For
example, which(iris[[1]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])
## [1] 107
tempBox <- boxplot(iris[[1]]~iris[[5]],plot=F)
ol1 <- which(iris[[1]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])
tempBox <- boxplot(iris[[2]]~iris[[5]],plot=F)
ol2 <- which(iris[[2]]==tempBox$out & iris[[5]] == tempBox$names[tempBox$group])
outlier <- union(ol1,ol2)
outlier
## [1] 107 42
Note other set operation are: * ‘union()’ *
‘intersect()’
* ‘setdiff()’
iris.boxList <- data.frame()
for(i in 1:4){
## i <- 1
tempBox <- boxplot(iris[[i]]~iris[[5]],plot=F)
tempDF <- as.data.frame(tempBox[c("out", "group")])
tempDF$colName <- i
tempDF$species <- tempBox$names[tempDF$group]
iris.boxList <- rbind(iris.boxList,tempDF)
}
iris.boxList
## out group colName species
## 1 4.9 3 1 virginica
## 2 2.3 1 2 setosa
## 3 1.0 1 3 setosa
## 4 3.0 2 3 versicolor
## 5 0.5 1 4 setosa
## 6 0.6 1 4 setosa
iris.boxList$which <- NULL
for(i in 1:nrow(iris.boxList) ){
## i <- 1
colIdx <- iris.boxList$colName[i]
species <- iris.boxList$species[i]
value <- iris.boxList$out[i]
resRow <- which( iris[[colIdx]] == value & iris[[5]]==species)
iris.boxList$which[i] <- resRow
}
iris.boxList
## out group colName species which
## 1 4.9 3 1 virginica 107
## 2 2.3 1 2 setosa 42
## 3 1.0 1 3 setosa 23
## 4 3.0 2 3 versicolor 99
## 5 0.5 1 4 setosa 24
## 6 0.6 1 4 setosa 44
iris
is a cleaned data, we have to worries only
outlier. We can manually identify outlier using
identify()
and scatter plot ##-- This code is NOT execute
xAxis <- iris[,1]
yAxis <- iris[,2]
plot(xAxis,yAxis)
identify(xAxis,yAxis)
iris.boxList
## out group colName species which
## 1 4.9 3 1 virginica 107
## 2 2.3 1 2 setosa 42
## 3 1.0 1 3 setosa 23
## 4 3.0 2 3 versicolor 99
## 5 0.5 1 4 setosa 24
## 6 0.6 1 4 setosa 44
outlier.Idx <- unique(iris.boxList$which)
iris.cln <- iris[-outlier.Idx,]
lof()
in rlof
package that uses the concept of
clustering to identify outlier. ##-- This code is NOT execute
require(Rlof)
lof.dist <- lof(iris[,1:4],k=8)
isGood <- which(lof.dist < 1.2)
iris.lof <- iris[isGood,]
pairs(iris.lof[,1:4],col=iris.lof$Species)
iris
dataset using ggplot2
packageggplot2
package is a part of tidyverse
that allows data.table
and data.frame
objects
to plot and visualize. ggplot2 is based on the grammar of graphics,
sepertating components: a data set, a coordinate system, and
geoms—visual marks that represent data points.geom_point()
geom_box()
aes(x=,y=,fill=)
note the code is available in the next tab
ggplot(iris, aes(Sepal.Width,fill=Species)) + geom_histogram(bins=25)
ggplot(iris) + geom_dotplot(aes(x=Sepal.Width,fill=Species))
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,color=Species)) + geom_point(position="jitter") + theme_classic()
ggplot()
with additional packages can do
any form of visualization. Here are some capability that we may use for
the team project. require(ggplot2)
require(data.table)
iris.DT <- as.data.table(iris)
iris.lng <- melt(iris.DT,id.var="Species")
ggplot(iris.lng, aes(x=variable,y=value,fill=Species)) + geom_violin() -> gg
gg + facet_grid(cols=vars(Species) ) + xlab("dimension") + ylab("Unit (cm)")
ggplot(iris.lng, aes(value,color=variable))+ geom_density() + facet_grid(cols=vars(Species) )
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,color=Species)) + geom_density_2d()
##-- seperate data filter and query
iris.DT[Species %like% "color"] %>%
ggplot(aes(x=Sepal.Width,y=Sepal.Length,color=Species)) + geom_point()
note
%>%
= piping (passing data) command in
dplyr
packagemelt
= command to make long tabledcast
= command to make short table ##-- This code is NOT execute
##-- If error check where is your file
exam.df <- as.data.frame(read.csv(file="examMarked.csv"))
colnames(exam.df) <- c("id",paste( "Q",1:30,sep=""))
stu1 <- exam.df[1,2:31]
stu2 <- exam.df[2,2:31]
##-- example of data
rbind(stu1,stu2)[,1:10]
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
## 1 1 1 1 1 0 1 1 1 1 1
## 2 1 1 1 0 0 1 1 0 1 1
checkFun <- function(stu1,stu2){
case00 <- length(which(stu1 == 0 & stu2 == 0))
case01 <- length(which(stu1 == 0 & stu2 == 1))
case10 <- length(which(stu1 == 1 & stu2 == 0))
case11 <- length(which(stu1 == 1 & stu2 == 1))
return( list(case00=case00,
case10=case10,
case01=case01,
case11=case11) )
}
## test your code
checkFun(stu1,stu2)
## $case00
## [1] 8
##
## $case10
## [1] 5
##
## $case01
## [1] 5
##
## $case11
## [1] 12
##-- This code is NOT execute
pair <- combn(10,2)
nPair<- ncol(pair)
pairResult <- data.frame(stu1ID=pair[1,],stu2ID=pair[2,],
case00=rep(NA,nPair),case10=rep(NA,nPair),
case01=rep(NA,nPair),case11=rep(NA,nPair))
#nPair <- nrow(pairResult)
for( i in 1:nPair){
## i <- 1 ## debug
stu1ID <- pairResult$stu1ID[i]
stu2ID <- pairResult$stu2ID[i]
exam1 <- exam.df[stu1ID,2:31]
exam2 <- exam.df[stu2ID,2:31]
compResult <- checkFun(exam1,exam2)
pairResult$case00[i] <- compResult$case00
pairResult$case10[i] <- compResult$case10
pairResult$case01[i] <- compResult$case01
pairResult$case11[i] <- compResult$case11
}
ord <- order(pairResult$smc,decreasing = T)
pairResult[ord,]
smc <- (pairResult$case00+ pairResult$case11)/sum(pairResult[,3:6])
cor.test(as.numeric(exam.df[1,2:31]),as.numeric(exam.df[2,2:31]))
##
## Pearson's product-moment correlation
##
## data: as.numeric(exam.df[1, 2:31]) and as.numeric(exam.df[2, 2:31])
## t = 1.7951, df = 28, p-value = 0.08343
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04410735 0.61083640
## sample estimates:
## cor
## 0.321267
USArrests
in package
datasets
by state with the thermal
mapmaps
package has a build-in worldmap function for
visualization map()
. The details of map may be varied
depending on each country. Here is a state map of US. require(maps)
map('state',col=c("red","blue","green"),fill=T)
USArrests
data. data <- as.data.frame(USArrests)
head(data)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
hist(data$Murder,col="grey")
rug(jitter(data$Murder))
box()
heat.colors()
scheme as color (red = hot;
yellow=warm) intQuantile <- c(0.1,0.25,0.5,0.75,0.9)
nRange <- length(intQuantile)
colorRange <- sort(heat.colors(nRange+1),decreasing = T)
pie(rep(1,nRange),col=colorRange)
valMurder <- quantile(data$Murder,intQuantile)
myCol <- as.character(cut(data$Murder,breaks = c(0,valMurder,20),labels=colorRange))
map('state',col=myCol,fill=T)
### 0D: label ##### preparation
##-- prepare legend
legendText <- c(paste(c(">",valMurder[nRange]),collapse=""))
for(j in (nRange-1):1){
legendText <- c(legendText,paste(c(valMurder[j],"-",valMurder[j+1]),collapse=""))
}
legendText <- c(legendText,paste(c("<",valMurder[1]),collapse=""))
legendText
## [1] ">13.32" "11.25-13.32" "7.25-11.25" "4.075-7.25" "2.56-4.075"
## [6] "<2.56"
map('state',col=myCol,fill=T)
legend("bottomright",legend=legendText,pch=rep(15,nRange),col=colorRange,ncol=1,cex=1.0,pt.cex=3.5
,y.intersp=0.5,bty="n")
plotThermalMap(type=1,quantLv=c(0.1,0.25,0.5,0.75,0.9))
##-- This code is NOT execute and incompleted
plotThermalMap <- function(type=1,quantLv=c(0.1,0.25,0.5,0.75,0.9)){
data <- as.data.frame(USArrests)
##-- This part is intentionaly left out --##
return(0)
}
plotThermalMap(4)
Rscript <fileName>.R
in
DOS) ##-- This code is NOT execute
typeName <- colnames(USArrests)
nType <- length(typeName)
for(i in 1:nType){
## i <- 1 ##-- for debug
fileName <- paste( c(typeName[i],".png"),collapse="")
png(fileName,width = 600,height = 600)
plot.new()
##-- function from the previous part
plotThermalMap(i)
dev.off()
}
note This last question is overlapped with the question in the next workshop.
sample.int()
or
sample()
. It is very important to indicate
set.seed()
##-- using `sample()`
set.seed(17)
smpl.idx <- sample(1:nrow(iris),size=30)
##-- using `sample.int()`
set.seed(17)
smpl.idx <- sample.int(nrow(iris),size=30)
iris.test <- iris[smpl.idx,]
iris.train<- iris[-smpl.idx,]
head(iris.test)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 108 7.3 2.9 6.3 1.8 virginica
## 42 4.5 2.3 1.3 0.3 setosa
## 129 6.4 2.8 5.6 2.1 virginica
## 6 5.4 3.9 1.7 0.4 setosa
## 133 6.4 2.8 5.6 2.2 virginica
## 110 7.2 3.6 6.1 2.5 virginica
NOTE Do the sample balance in term of Spices? If not, is there any solution
class::knn()
require(class)
species.knn <- knn(train=iris.train[,-5],test=iris.test[,-5],
cl=iris.train[,5],k=3)
ftable(iris.test[,5],species.knn)
## species.knn setosa versicolor virginica
##
## setosa 11 0 0
## versicolor 0 9 1
## virginica 0 1 8
base::glm()
##-- casting factor into number (0-1)
iris.train$specIdx <- as.numeric(iris.train$Species)
iris.test$specIdx <- as.numeric(iris.test$Species)
species.glm <- glm(specIdx~. -Species,data=iris.train,family ="poisson")
iris.glm <- round(predict(species.glm,newdata = iris.test,type = "response"))
ftable(iris.test[,5],iris.glm)
## iris.glm 1 2 3 4
##
## setosa 11 0 0 0
## versicolor 0 10 0 0
## virginica 0 0 8 1
rpart::rpart()
require(rpart)
iris.rpart <- rpart(Species~.,data=iris.train)
require(rpart.plot)
prp(iris.rpart)
species.rpart <- apply(predict(iris.rpart,newdata = iris.test),1,which.max)
ftable(iris.test[,5],species.rpart)
## species.rpart 1 2 3
##
## setosa 11 0 0
## versicolor 0 10 0
## virginica 0 0 9
NOTE The detail and how to select a suitable model will be discussed in the next workshop.
Copyright 2019 Oran Kittithreerapronchai. All Rights Reserved. Last modified: 2023-31-13,