Big data is growing not only in size, but in complexity. Predicting when and where crime will occur poses a particularly interesting data challenge — the data are sparse, have complex internal dependencies, and may be affected by many different types of features — weather, city infrastructure, population demographics, public events, and government policy. Here, I offer an approach for dealing with datasets that are not straightforwardly amenable to classic machine learning techniques. I show that a combination of machine learning, time series modeling, and geostatistics is more effective at predicting future crime than any of these techniques alone.
Using a variety of public data sets, including police reports, the US census, Foursquare, newspapers, and the weather, I discuss how to merge, visualize, model, and deploy this type of multi-dimensional data. Specifically, I engineer spatial features using PostGIS and spatial mapping, and employ targeted statistical techniques (e.g. Bayesian time series decomposition; spatial kriging) and machine learning (e.g. XGBoost, artificial neural nets) to predict crime future crime. Finally, I deploy this model using a public REST API, allowing real time modeling of a crime “hotspots” in the next week. I consider the challenges of deploying complex “ensembled” models, and discuss techniques to support scalability. Finally, I discuss the features that are most predictive of future crime, and ask how we can use these types of models to understand where crime will occur next, what predicts it, and what we can do to prevent it in the future.