Room Occupancy Prediction
1 Project Brief
Sensors play a critical role in generating signals that provide us with information about local systems and, in a collective sense, about the world. There is a large amount of data that can be gathered from our surroundings and used to analyze nearly anything, from the flow of people to the quality of public water and air to the stability of a bridge under load.
The recent proliferation of the Internet of Things (IoT) and the growing ubiquity of Smart devices provides the people with an unprecedented perspective of the environment, its constituent elements, and behavior.
In our project, our goal is using physical sensors to collect environmental data, namely, temperature, humidity, pressure, light and predicting room occupancy using all variables above as predictors.
For context on the measurements the device returns, here are details for the sensor modules:
+Temperature (port A) - DHT12, -20 ~ 60 °C ± 0.2 °C
+Humidity (A) - DHT12, 20 ~ 95 %RH ± 1 %RH
+Pressure (A) - BMP280, 300 ~ 1100hPa ± 1 hPa
+Light (B) - cadmium sulfide photoresistor, dimensionless on a 12-bit (0-4095) scale-Photoconductive, non-linear, with non-uniform spectral response
Sample rate for each parameter:
3 Room Occupancy
Room occupancy is an important indicator for building energy consumption in daily operation and in energy modelling in design phase. For energy modelling in design, the conventional way of representing room occupancy is to have a hourly schedule every hour (Fig.1), using a number 0~1 indicating how full (1) or empty(0) the space is. There are standards and guidelines available for schedule lookup, but since it's a generalization of historical records, it doesn't necessarily reflect the reality and the resolution is too often too rough for a realistic estimate. Taking advantages of statistical model, we could increase the resolution to less than 1 hour to have a more realitic representation of the occupancy. In this project, the time step for each occupancy record is 5mins (Fig.2).
Fig.1 Occupancy Schedule (1hour timestep)
Fig.2 Occupancy Schedule (5min timestep)
4 Data Collection
Two sensors were used at two locations across the room in case one sensor got ill-reading. Also the data were averaged between the two sensors in order to have a better representation of the physical environment of the entire room. The true occupancy was manually recorded from the video camera placed in the room.
The data collection work was carried out by regular collect-and-restart routine at a frequency of approximately 1-2 days. The whole data collecting period last about four weeks (starting Nov 6th through Dec 6th).
In each group of data, different sensor readings are originally collected into separate cvs files and their length might not be equal due to different sample rate among different parameters. These issue will be resolved in the later data processing procedure by down sampling.
5 Exploretory Data Analysis
To prepare the data ready for training, a overall data sanity check and cleaning was done to get rid of the ill reading. Turned out the sensors worked well and there were only a hundreds of records that was ill-recorded. These data were dropped directly.
The selected predictors were:
1 Hour of the day (integer value 0~23)
2 Minute of an hour (integer value 0~59)
3 Temperature (real value)
4 Relative Humidity (real value, 20~95)
5 Pressure (real value)
6 Light (integer 0~4095)
7 IsWeekday (binary, 0 or 1)
Target : Occupancy (binary, 0 or 1)
Sample Data Plots
Fig.3 Data Snippet
Fig.4 Sample Plots
In this project, I tested three models on the dataset and pick up the one that gave me best ROC curve
A) A Logistic Regression Model
As the problem I wanted to solve here is a binary classification problem, logistic model with regularization would be a natural choice.
The model was tested using a Gridsearch Cross Validation process to test different c value(regularizer coefficient), the best performance was achieved with c=0.1.
B) A Random Forrest Classifier
To have a better interpretation of the feature importance, i also trained a random forrest classification model. Gridsearch was also employed in the model selection.
C) A Feed Forward Neural Network
The architecture of the neural network was shown on the right.
Two dropout layers were added for regularization to prevent the model from overfitting.
Fig.5 NN Architecture
6 Model Comparison
To evaluate a typical classification model, looking at the accuracy only would be enough, but it was not the case here because the dataset is imbalnced -- most of the time the room was not occupied. To address this issue, different threshold values for classification needs to be tested. Also, both recall and precision also needs to be considered here, and they are equally important in our case. Thus, f1 score should be a better metric for model performance.
Fig. 6 shows the f1 score for the three models under different threshold values. Fig 7 shows the ROC AUC of the three models. Fig8 shows the feature importance based on the tree model.
Based on the two plots, the random forrest classifier and the neural network have similar performance and are both better than logistic regression.
Consider the intepretability, random forrest could be a even better choice.