Skip to main content

This data science use case was done by Mamta Maduginki, Raeena Firdous, and Syeda Ayesha.  

Introduction

“The Power of A360 AI” is a series of articles written by Andromeda 360, Inc. data scientists where we showcase machine learning-powered business applications we have built for our clients to improve business outcomes. A360 AI is an open AI delivery platform that enables enterprises to build and deploy machine learning (ML) models securely into production within minutes. Read this post to find out why A360 AI capabilities are more effective than most cloud-specific AI platforms such as AWS SageMaker. A360 AI Community Edition is freely available to data scientists. You can sign up here and try these sample use cases yourself. 

Use Case

“Attrition in human resources refers to the gradual loss of employees over time. In general, relatively high attrition is problematic for companies. HR professionals often assume a leadership role in designing company compensation programs; work culture and motivation systems that help the organization retain top employees.” 

Our goal is to uncover the factors that lead to employee attrition through Exploratory Data Analysis, and explore them by using various classification models to predict if an employee is likely to quit. This could greatly increase the HR’s ability to intervene on time and remedy the situation to prevent attrition. 

Dataset

  • The dataset is publicly available from here 
  • The data columns are:   
  1. BUSINESS_TRAVEL – Business travel is the travel undertaken for work or business purposes.
  2. DAILY_RATE – the prescribed amount of pay for a given job of work paid for by the day or hour.
  3. DEPARTMENT – It shows to which department an employee belongs to.
  4. EDUCATION – It specifies number of years of education completed.
  5. EDUCATION_FIELD – It shows to which education field an employee belongs to.
  6. ENVIRONMENT_SATISFACTION – Numerical Value – satisfaction with the environment.
  7. HOURLY_RATE – it shows hourly salary.
  8. JOB_INVOLVEMENT – A person who has a high level of job involvement usually obtains major life satisfaction from the job (1 means High -4 means Low).
  9. JOB_LEVEL – categories with different titles and salary ranges within a workplace.
  10. JOB_ROLE – It simply defines the job position.
  11. JOB_SATISFACTION – A feeling of fulfillment or enjoyment that an employee derives from their job (1 means High – 4 means Low).
  12. MONTHLY_INCOME – Amount paid to an employee within a month.
  13. NUM_COMPANIES_WORKED – Number of companies an employee worked or experience an employee had.
  14. OVER_TIME – it simply defines an overtime work done by an employee.
  15. PERCENT_SALARY_HIKE – (New salary – Old salary) * 100 / (Old salary).
  16. PERFORMANCE_RATING – It observes the worker’s performance and records a value.
  17. STOCK_OPTION_LEVEL – It is issued by the company for its employees to encourage employee ownership in the company. The shares of the companies are given to the employees at discounted rates.

The dataset consists of 1,000,000 data points (1,000,000 rows). As for the features we have 35 features (35 columns) and our target variable is the ‘Attrition’, which comprises of two values i.e. ‘Yes’ and ‘No’.

Workflow

In this project, we followed the standard data science workflow:  

  1. Exploratory data analysis (EDA)
  2. Data preprocessing and feature engineering
  3. Model training, experiments, and evaluation
  4. Model deployment
  5. Inference
  6. Monitoring  

 

Note that the notebooks for each step above have been provided in the Employee Attrition Github Repository link 

1. Exploratory data analysis

Since the business values and goals are clear, the first step is to understand the dataset by generating visualization from the data and see if we canfindsome insights. 

Fig-1: Analysis of Attrtion Rate

As it is a classification problem, here we have equal number of data points for each condition, so from the above graph we can observe that our dataset is balanced.

Fig-2: Impact of Business Travel and Age on Attrition

From the above graph we can observe that if the age ranges between 30-40 or 50-60 and is traveling rarely is most likely to face attrition.

Fig-3: Analysis of Attrition rate in different Department

From the above graph we can observe that every department has equal number of attrition rate, there is no specific department which has high or low attrition rate.

2. Data Preprocessing and Feature Engineering

Now after visualizing the data, we have preprocessed it by removing the outliers and cleaned the unusual data like “the employee Age is 25 but he works 27 years at the company” etc. After the data cleaning, our dataset shape is (546,221 rows, 28 columns). As for the categorical feature, such as business travel and Department, we implemented one-hot encoding. We have performed the normalization on the data.  After the feature engineering, we expanded the number of features from 28 to 46 columns.

Highlight of A360 AI functionality

In A360 AI platform, we provide an easy-to-use API, called MDK (model development kit), to connect the cloud storage with the working Jupyter Lab environment. If you were in SageMaker, you would have to use a Python package boto3 and write at least 10-20 lines of the code to download/upload your data from/to an S3 bucket. With A360 MDK API (a360ai), you can directly load the csv file from S3 and write the feature engineered dataframe back to S3 with just one line of the code, such as a360ai.load_dataset and a360ai.write_dataset.

3. Model training

After preprocessing, our training data was ready for model training. We treat this as a classification problem because we are predicting whether the employee get attritioned or not. We further split the data into train/test with 80/20 ratio. Initially we applied EvalML an AutoML library to examine various algorithms, and then we manually trained the dataset using many classification algorithms like logistic regression, decision tree, random forest, XGBoost, LightGBM, and CatBoost etc. Out of the tried classification algorithms LightGBM obtained the better accuracy.

As the model we built was not performing well because we obtained a lower accuracy of 69%, so we developed another model by creating a new feature “attrition_within_a_year” using “YearsAtCompany” feature, to see whether an employee get attritioned within a year or not. We made this “attrition_within_a_year” feature as our dependent variable and dropped the “attrition” feature from our dataset. Initially it was in imbalanced form, so for handling that we used an oversampling technique (SMOTE). Below are the graphical representations of “Employee Attrition Prediction”-model-1, “Employee Attrition within a Year Prediction”-model-2

Fig-4: Attrition

Fig-5: Attrition within a year

As we can observe from the above ROC-AUC curve, from model-2 i.e., “Employee attrition within a year prediction” we obtained a better accuracy from almost all the algorithms which we have applied.

With this proof-of-concept model-2 we built, the best accuracy score is around 96%. Then we proceeded to deploy this model as a cloud endpoint, which can be used to make predictions in client’s business application.

Highlight of A360 AI functionality

With A360 MDK, you can easily track your model experiment and hyperparameter tuning. By adding a few lines of the code to log your hyperparameters and metrics, MDK will track your model experiment and provide a clean table to show you the metrics corresponding to the hyperparameters, so you can quickly see which model has the best result.

4. Model Deployment

In A360 AI platform, deployment is fast and easy. You only need to set a final run with our MDK API in the modeling notebook. Then A360 AI’s packaging technology, called Starpack, will fetch the model artifacts and training data baseline to package the model as a Docker container. This container is then deployed automatically into a scalable, secure Kubernetes pod as a cloud endpoint REST API. We only need to do a few clicks on the platform UI (A360 Deployment Hub). Below are a few screenshots of deployment process on the UI. During the deployment process, saving the endpoint API key is a requirement step. The API key is required to invoke the cloud endpoint. The whole deployment process only took about 5 minutes to complete.

Fig-6: A360 platform for deployment

Highlight of A360 AI functionality

Starpack is the key technology of A360 AI as Model Deployment as Code (MDaC), building upon the concepts of Terraform which is an Infrastructure as Code (IaC) approach. Starpack utilizes the declarative language (YAML specification) to automatically deploy ML models leveraging GitOps. Along with a UI console, A360 AI completely abstracts the infrastructure complexity for ML deployment from data scientists.

However, we have created and deployed the web app on streamlit 

5. Inference

Once our REST API is available, we can invoke it with new input data. The endpoint URL can be easily retrieved from the Deployment Hub UI. In the notebook, we simply utilized Python request and use API key to send input data as JSON format to the endpoint, and got the prediction result back. The inference process on A360 AI platform is very straightforward. 

6. Monitoring - Data Drift

After the model is deployed, it is a big milestone that the model can actually be used in the business application. However, the job is not done yet. Data scientists would want to closely monitor their model performance as the new data will continue coming in. 

Highlight of A360 AI functionality

A360 AI has a pre-built monitoring dashboard that helps data scientists to monitor the data drift. The metric sigma (mean value of the standard deviation of the training data) is defaulted in the dashboard to monitor the data drift. If the sigma value is over 2-3, it flags the data drift and data scientists should examine the new incoming data and see if the model re-training is required.

Conclusion

Here we showcase a use case, to prevent future loss of talent because of attrition, and walk you through the data science process you can take in A360 AI platform to tackle this problem. We also demonstrate how A360’s MDK and Starpack can make data scientists more efficient in building and deploying ML models as well as processing data from the cloud and monitoring the data drift.

Leave a Reply