Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this report, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The goal of this report is to predict the manner in which the participants did the exercise. This is the “classe” variable in the training set.
The data for this report are composed of two parts. The training dataset is available at: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and The test data are available at: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
train <- read.csv("./pml-training.csv")
test <- read.csv("./pml-testing.csv")
We then partition the train data set into two sets. One for training and one for cross validation.
library(caret)
set.seed(1114)
inTrain <- createDataPartition(train$classe, p=0.7, list=FALSE)
training <- train[inTrain,]
CVtesting <- train[-inTrain,]
To preprocess the data, we remove predictors with zero variance for training, cross validation and test data sets.
zerovar <- nearZeroVar(training)
training <- training[-zerovar]
CVtesting <- CVtesting[-zerovar]
test <- test[-zerovar]
The next step is to imputing data including the two parts of data filtering: 1. remove the columns/predictors with a lot NAs. 2. We also decide to remove columns/predictors with non-numeric outcomes (except “classe”) since these predictors (including “X”, “user_name”, “new_window”, “num_window”, and “timestamp”) do not seem to be good predictors.
So we can filter the columns with the class of outcome of “numeric”, and then impute the missing values for all data sets:
predTruth <- which(lapply(training, class) %in% c("numeric"))
datapro <- preProcess(training[, predTruth], method = c("knnImpute"))
library(RANN)
protraining <- cbind(training$classe, predict(datapro, training[, predTruth]))
proCVtesting <- cbind(CVtesting$classe, predict(datapro, CVtesting[, predTruth]))
protesting <- predict(datapro, test[, predTruth])
For the new, preprocessed training data, we need to assign the name for the the first column.
names(protraining)[1] <- c("classe")
names(proCVtesting)[1] <- c("classe")
We will use random forest method to do the modeling. To reduce the computing time, we set the optimized “mtry” value to 32.
library(randomForest)
modelFit <- randomForest(classe ~ ., protraining, ntree = 500, mtry = 32)
predtraining <- predict(modelFit, protraining)
confusionMatrix(predtraining, protraining$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4464 0 0 0 0
## B 0 3038 0 0 0
## C 0 0 2738 0 0
## D 0 0 0 2573 0
## E 0 0 0 0 2886
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (1, 1)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 1.000 1.000 1.000 1.000
## Specificity 1.000 1.000 1.000 1.000 1.000
## Pos Pred Value 1.000 1.000 1.000 1.000 1.000
## Neg Pred Value 1.000 1.000 1.000 1.000 1.000
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.194 0.174 0.164 0.184
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 1.000 1.000 1.000 1.000 1.000
We can see that the confusion matrix of the final model on the training data showing a 100% of accuracy, and 0% of error rate. We expect lower accuracy rate and higher out-of sample error rate on the cross validation and final test sets.
We test the model on the cross validation data before on the testing data to see the accuracy of the modeling.
predCV <- predict(modelFit, proCVtesting)
confusionMatrix(predCV, proCVtesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1110 6 1 1 0
## B 3 749 5 1 0
## C 2 3 671 8 0
## D 0 1 5 632 2
## E 1 0 2 1 719
##
## Overall Statistics
##
## Accuracy : 0.989
## 95% CI : (0.986, 0.992)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.986
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.995 0.987 0.981 0.983 0.997
## Specificity 0.997 0.997 0.996 0.998 0.999
## Pos Pred Value 0.993 0.988 0.981 0.987 0.994
## Neg Pred Value 0.998 0.997 0.996 0.997 0.999
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.283 0.191 0.171 0.161 0.183
## Detection Prevalence 0.285 0.193 0.174 0.163 0.184
## Balanced Accuracy 0.996 0.992 0.988 0.990 0.998
The confusion matrix results show the high accuracy of ~99% on the cross validation set with erro rate of ~1%.
Finally, we apply the model on the testing set.
predtesting <- predict(modelFit, protesting)
predtesting
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
We then predict the test set using the selected model, and obtain a 100% accuracy on the 20 test samples.
The model developed in this report gives a good prediction on the activity class of weight lifting exercises.