Skip to main content

This data science use case was done by Waiz Khan, Bhagya SH, and Saba Nazneen.

Introduction

“The Power of A360 AI” is a series of articles written by Andromeda 360, Inc. data scientists where we showcase machine learning-powered business applications we have built for our clients to improve business outcomes. A360 AI is an open AI delivery platform that enables enterprises to build and deploy machine learning (ML) models securely into production within minutes. Read this post to find out why A360 AI capabilities are more effective than most cloud-specific AI platforms such as AWS SageMaker. A360 AI Community Edition is freely available to data scientists. You can sign up here and try these sample use cases yourself.

Use Case

Crime is a major issue every country in the world is tackling in their own ways. Crime affects residents mental state as well as their finance on a larger scale. In this project, we aim to find an optimal solution that can reduce the number of crimes in Boston metropolitan area by (1) building a ML model that predicts the number of crimes daily and (2) analyzing the data and finding insights, information, and the root cause. Our main goal here is to somehow decrease the crime rate.

Dataset

(1) INCIDENT_NUMBER: this is the unique ID of every emergency reported. 
(2) OFFENSE_CODE: This contains the unique code associated with each type of emergency. 
(3) OFFENSE_CODE_GROUP: This contains the OFFENSE_CODE associated with an emergency group. 
(4) OFFENSE_DESCRIPTION: This is the description of the reported emergency. 
(5) DISTRICT: This column contains the 12 unique district codes in which Boston is divided. 
(6) REPORTING_AREA: This shows the area code from where the emergency is reported. 
(7) SHOOTING: This shows if there is a shooting reported or not on the location of reported emergency. 
(8) OCCURRED_ON_DATE: The date and time of the occurrence of emergency. 
(9) YEAR: Year of reported emergency. 
(10) MONTH: Month of reported emergency. 
(11) DAY_OF_WEEK: The day of the week on which the emergency is reported. 
(12) HOUR: The hour of day of reported emergency. 
(13) UCR_PART: It contains the part of Uniform Crime Reports i.e., Part One (heinous crimes), Part Two (intermediate level crimes), Part Three (low level crimes) 
(14) STREET: Name of street where the emergencies occurred. 
(15) Lat: Latitude of occurred emergency. 
(16) Long: Longitude of occurred emergency. 
(17) Location: Contains tuple of Lat and Long i.e. (Lat, Long). 

Workflow

In this project, we followed the standard data science workflow: 

  1. Exploratory data analysis (EDA)
  2. Data preprocessing and feature engineering
  3. Model training, experiments, and evaluation
  4. Model deployment
  5. Inference
  6. Monitoring
  7. Further analysis 

Note that the notebooks for each step above have been provided in the A360 AI example GitHub repository  

1. Exploratory data analysis

Since the business values and goals are clear, the first step is to clean the data because data is the king always in data science, so we need to keep it clean in order to get the best insights and information out of it and generate the best solution.  

After cleaning the data, we understand the dataset by generating visualization from the data and see if we can find some insights. Look at the graphs below: 

On the upper left we see that the crime rate is increasing from 2015 to 2017 and after 2017 the crime rate is decreasing every year. 

On the bottom right we see Friday has the highest crime rate followed by Wednesday and Thursday because the next day is weekend and maybe people enjoy going out that makes them easy target for criminals and other type of emergencies takes place like minor accidents. 

Here we can see that every day of the week the most reported emergencies is UCR part three which are non-lethal emergencies/crime, and we assume that this crime happens out of necessity to survive. 

From the above graph we can see that summer has the highest emergencies reported, maybe due to people are mostly out of house that makes easy target for theft and robbery and winter has lowest reported emergencies because there’s snowfall that makes it hard for the offenders to commit crimes and because people might not be roaming outside in the open due to freezing cold temperature. 

We found some more insights but unfortunately not every insight can be included on a webpage. Nevertheless, you can look here.

2. Data Preprocessing and Feature Engineering

Now when the EDA is done, we need to curate the model for the modeling. We tried creating more features from the existing data that give stronger patterns i.e., we created the SEASON from the MONTH column and further created more columns like snowfall (inches), precipitation, and some factors from external sources that contribute to the crimes and emergencies like unemployment rate. As for the categorical data like UCR_PART, SEASON, etc. We performed one-hot encoding and that changes our number of features from 17 to 42 and for datetime data type of column OCCURRED_ON_DATE we converted it into ordinal number since we know most of ML algorithms only want numbers to work. 

Until now we didn’t have any feature with number of emergencies that occurred on each day that we will be predicting. So, we needed to create our target variable. For that we grouped the data according to dates, weekdays, district and count number of emergencies reported each day in each district with the help of the existing feature named INCIDENT_NUMBER and left with ~32,000 datapoints. 

Highlight of A360 AI functionality

In A360 AI platform, we provide an easy-to-use API, called MDK (model development kit), to connect the cloud storage with the working Jupyter Lab environment. If you were in SageMaker, you would have to use a Python package boto3 and write at least 10-20 lines of the code to download/upload your data from/to an S3 bucket. With A360 MDK API (a360ai), you can directly load the csv file from S3 and write the feature engineered data frame back to S3 with just one line of the code, such as a360ai.load_dataset and a360ai.write_dataset. 

3. Model training

After the feature engineering, our training data was ready for model training. We treat this as a regression problem because we are predicting the number of crimes. We further split the training data into train/test with 70/30 ratio. We built model with many algorithms like SVM, XGBoost, AdaBoost, LinearRegression, LGBMRegressor, RandomForestRegressor, etc. and also did hyperparameter tuning using GridSearchCV and RandomizedSearchCV for xgboost model as it was giving the best r2_score among all tried regression algorithms to predict number of crimes that may happen on a particular day in a particular district on a particular day of the week and also we tried AutoML techniques like TPOTRegressor and AutoKeras and did some feature selection. Below is the graphical representation of the applied algorithms performance. 

Highlight of A360 AI functionality

With A360 MDK, you can easily track your model experiment and hyperparameter tuning. By adding a few lines of the code to log your hyperparameters and metrics, MDK will track your model experiment and provide a clean table to show you the metrics corresponding to the hyperparameters, so you can quickly see which model has the best result.

With this proof-of-concept model we built, the best accuracy score is around 88%. Then we proceeded to deploy this model as a cloud endpoint, which can be used to make predictions in client’s business application.

4. Model Deployment

After the feature engineering, our training data was ready for model training. We treat this as a regression problem because we are predicting the number of crimes. We further split the training data into train/test with 70/30 ratio. We built model with many algorithms like SVM, XGBoost, AdaBoost, LinearRegression, LGBMRegressor, RandomForestRegressor, etc. and also did hyperparameter tuning using GridSearchCV and RandomizedSearchCV for xgboost model as it was giving the best r2_score among all tried regression algorithms to predict number of crimes that may happen on a particular day in a particular district on a particular day of the week and also we tried AutoML techniques like TPOTRegressor and AutoKeras and did some feature selection. Below is the graphical representation of the applied algorithms performance. 

Highlight of A360 AI functionality

Starpack is the key technology of A360 AI as Model Deployment as Code (MDaC), building upon the concepts of Terraform which is an Infrastructure as Code (IaC) approach. Starpack utilizes the declarative language (YAML specification) to automatically deploy ML models leveraging GitOps. Along with a UI console, A360 AI completely abstracts the infrastructure complexity for ML deployment from data scientists.

5. Inference

Once our REST API is available, we can invoke it with new input data. The endpoint URL can be easily retrieved from the Deployment Hub UI. In the notebook, we simply utilized Python request and used API key to send input data as JSON format to the endpoint, and got the prediction result back. The inference process on A360 AI platform is very straightforward.

6. Monitoring - Data Drift

After the model is deployed, it is a big milestone that the model can be used in business applications. However, the job is not done yet. Data scientists would want to closely monitor their model performance as the new data will continue coming in. 

The left panel shows the accumulated sigma value for each feature. If the sigma value is above 3, the color becomes red, which is the alert for data scientists, indicating possible data drifts. The right panel shows the sigma changes through time, which is useful to examine the time-series data drift as well as distinguish if you just have one data point as anomaly or the drift is persistent for multiple data points.

7. Further Research

As we have seen already that the performance of our best model is around 78.80%, which is not considered a very good model and we all know that not every data is for model building, but every data has information hidden that helps in targeting the problem. Don’t forget our main aim is still to decrease the number of crimes.  

So, we have done some research to help in fighting the crime rate and play a role in creating a better and safer community. 

We started with researching the number of prisoners in Boston jails but unfortunately, we were unable to find the data for Boston. Since Boston is a city in Massachusetts so we went searching about prison data of Massachusetts and we found out that there is a total of ~14,000 prisoners on which the total spend is ~$650M per year. 

Next, we tried dividing the prisoners according to their races and from here it started getting interesting.

From the above table we can see that the prisoners from white races are more, followed by black and Hispanic race (You will conclude most of white races are wrongdoers but wait!!!). Let’s see the percentage of people in prisons from total population of individual races.

Look, from here we can conclude that black race has highest percentage of population in jails and white race has lowest percentage. Now, we assume there is some correlation between races and crime rates.  

Remember, while performing EDA we found that B2 district has the highest crime rate followed by its surrounding districts. Let’s find population density according to race of Boston and check if our assumption we made earlier about the connection between races and crime rate is correct or not. 

We found some demographic data from a research which is shown below 

From the above image we see that the black race is dominant in the center of the city and its surrounding districts. So, now we can confirm that our assumption was correct about the connection between races and crime rates. 

Now, we are done…. NO. The solution is still pending!!! 

Major question: Why black race??? 

With a little more research we found the following information about black race 

Here we can see that the unemployment rate in black race is the highest if we ignore the unemployment rate of “other race” because it is a combination of unemployment rate of other races. 

Assumption alert!!! Unemployment leads to poverty and poverty makes crime a necessity. 

Why is there so much unemployment in black race despite having the second highest population density after white race??? 

Further, we found the reason hidden in the education rate of races 

We found that till high school education rate is almost similar for every race but look at the bachelor’s education rate (Yes, the green bar). Black races have the lowest bachelor’s education rate. 

Now everything is making sense. Let’s connect the dots!!! 

Because there is a low education rate in the black race which is the reason for their unemployment rate and unemployment rate leads to increase in poverty rate and poverty makes people to commit non-lethal crimes for their survival. 

So, we finally conclude that literacy rate is indirectly leading to an increase in crime rate. 

If the responsible authorities focus more on finding the reason for low literacy rate in black race and provide higher education to them, that will lead to decrease in the crime rate which will be beneficial for the community, and it will save millions of dollars of people’s tax that are spent on prisoners each year. 

Conclusion

In this project. we showcase a business use case on decreasing crime rate in Boston, and walked you through the data science process you can take in A360 AI platform to tackle this problem. We also demonstrated how our MDK and Starpack can make data scientists more efficient in building and deploying ML models as well as processing data from the cloud and monitoring the data drift. 

Leave a Reply