In this article I want to show how Apache Spark can be used to classify human activity based on smartphone data. We will build and train two simple multi-label classifiers using decision trees and random forests.
We will use the Human Activity Recognition Using Smartphones Data Set provided by the UC Irvine Machine Learning Repository. The dataset is described as follows:
The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
The dataset contains 10299 instances (7352 train and 2947 test samples) distributed among six classes: WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING. Each record have 561 features.
Before we train our classifiers, we need to prepare train and test sets. We need four files from the dataset:
- X_train.txt – train data, each row is a separate instance
- y_train.txt – train data labels
- X_test.txt – test data, each row is a separate instance
- y_test.txt – test data labels
Data and corresponding labels are in separate files, so we have to join them. To do so, I used two Apache Spark functions: zipWithIndex and then join. Next, I convert the data do LabeledPoint – it is simply a vector with associated label. Labels numbers must be adjusted to match LabeledPoint requirements – accoring to the documentation, for multiclass classification, labels indices should start from 0. Additionaly, I decided to extract a validation set from test data, which will be used for preliminary model evaluation.
Model construction and evaluation
Both decision trees and random forests have some tunable parameters. We will use the validation set to select the optimal values. During preliminary evaluation several versions of classifier are built and then evaluated using metrics calculated for each class: precision, recall, true positive rate, false positive rate and f-measure. Model with the highest score (the highest sum of metrics over all classes), is selected as a final model that will be evaluated against test set. If you are interested in detailed info about decision trees, random forests and their parameters, please refer to Apache Spark documentation: decision trees, random forests.
Final model parameters: impurity: gini, maxBins: 25.
Decision trees – confusion matrix
|Class / Metric||Precision||Recall||True positive rate||False positive rate||F measure|
Final model parameters: impurity: gini, maxBins: 200, numTrees: 100.
Random forests – confusion matrix
|Class / metric||Precision||Recall||True positive rate||False positive rate||F measure|
By comparing performances for each class, it is clear that random forests achieved better results in terms of precision and f-measure and had better true positive rate. It is not a surprise, as random forests are more complex and powerful, but overall both classifiers got decent results and were perfect for LAYING class.
Full source code is available on our github: https://github.com/Semantive/apache-spark-examples