This section compares several classification approaches for identifying robbery incidents using temporal, spatial, and neighbourhood-level predictors. The goal is to evaluate which models offer the strongest predictive performance and the clearest interpretation of robbery risk.
Each model underwent rigorous evaluation, including training accuracy, testing accuracy, precision, recall, and F1 score assessments, providing a comprehensive understanding of model performance. The performance scores are shown in the Performance Metrics of Different Models.
In the table, Bagging has the highest training accuracy, and Random Forest has the second highest training accuracy, which is reasonable due to their design nature. Comparing test accuracy, we can see that both Random Forest and Bagging perform better than other models, achieving accuracies of 0.8836276 and 0.8733949, respectively.
Upon examination of precision, recall, and F1 Score, it is evident that Random Forest and Bagging still exhibit better performance than other models. Random Forest demonstrates comparatively better performance in precision and F1 Score, while Bagging excels in recall. Although Bagging achieves a higher recall score than Random Forest, and recall is an important measure for identifying actual robbery cases, which aligns with our primary objective, Random Forest achieves much higher precision than Bagging, despite the difference in recall for both models. Since our goal is to help people avoid becoming victims of robbery, it is worth noting that higher precision may be more effective in reducing unnecessary panic, as precision measures the proportion of actual robberies among predicted robberies. Consequently, I selected Random Forest as the final model.
| Model | Train_Accuracy | Test_Accuracy | Precision | Recall | F1_Score |
|---|---|---|---|---|---|
| GLM | 0.8382214 | 0.8324639 | 0.6756757 | 0.0294811 | 0.0564972 |
| Classification Tree | 0.8417477 | 0.8342697 | 0.5916667 | 0.0837264 | 0.1466942 |
| Bagging | 0.9980218 | 0.8733949 | 0.6817420 | 0.4799528 | 0.5633218 |
| Random Forest | 0.9973338 | 0.8836276 | 0.7723577 | 0.4481132 | 0.5671642 |
| Boosting | 0.8395975 | 0.8338684 | 0.7631579 | 0.0341981 | 0.0654628 |
| XGBoosting | 0.8879333 | 0.8430979 | 0.5993976 | 0.2346698 | 0.3372881 |
The variable importance of the fitted Random Forest model is shown in Figure 7 below. Spatial indicators such as Longitude, Latitude, and Premises_Type are ranked as the three most important variables in the model. Following spatial indicators, the second most important factors are temporal factors, including Hour, Month, Day_of_Week, and Season. Finally, we come to the socio-economic factors of neighborhoods, ranked in descending order of importance: Number_of_Rental_Properties, Permanent_Job_and_Labour_Force_Ratio, Transportation_Service_Worker_and_Population_Ratio, HealthCare_Service_Worker_and_Population_Ratio, and so on. It is worth noting that the newly created variable Difference_in_Individual_Median_Average_Income is ranked higher than both Individual_Median_Income and Individual_Average_Income, implying success in variable creation.