IBM HR Analytics - Employee Attrition & Performance
Thank you to Kaggle user pavansubhash for contributing this dataset to experiment with. This is presented as just a topic exploration and skill-building, so all code is visible - this is not in a condensed "Final Presentation" format nor is it fully exhaustive.
The Data in Question
This fictional dataset, created by IBM data scientists, represents employee data from one business including numerous data points regarding the employees and. The goal is to analyze this data and gain insight into patterns and factors that affect employee performance within and attrition from the business.
The data file includes:
See Appendix for information on codified data.
This is not an exhuastive list and I will pare down the available data to make this project more concise and readable.
Age - The employee's age at time of data collection. - integerAttrition - Whether this employee left or stayed employed - string, "Yes" or "No"BusinessTravel - A category representing how often an employee travels for work - stringDepartment - The company's internal department that an employee works in - stringDistanceFromHome - The employee's daily commute one-way, in miles- integerEducation - A codified representation of an employee's highest attained education - intEducationField - The employee's general studied field - stringEmployeeNumber - A unique identifier for the employee - intEnvironmentSatisfaction- A codified representation of the employee's satisfaction with their work environment from an assessment - intGender - The employee's gender - stringHourlyRate - The employee's hourly wage in dollars - intJobLevel - A codified representation of the employee's rank in the organization - intJobRole - The employee's title - stringJobSatisfaction - A codified representation of the employee's satisfaction from an assessment - intMaritalStatus - The employee's marital status - stringMonthlyIncome - The employee's monthly income - intNumCompaniesWorked - The number of companies the employee has worked at previously - intOver18 - A flag showing the employee's age over 18 years - string, "Yes" or "No"OverTime - A flag showing the employee's eligibility for overtime pay - string, "Yes" or "No"PerformanceRating - A codified representation of the employee's assessed work performance - intRelationshipSatisfaction - A codified representation of the employee's satisfaction of their personal relationship - intStockOptionLevel - Unknown - intTotalWorkingYears - The employee's accumulated years of experience at any employer - intTrainingTimesLastYear - The number of trainings this employee completed in the last year - intWorkLifeBalance - A codified representation of the employee's work-life balance from an assessment - intYearsAtCompany - The employee's accumulated years of experience at this employer - intYearsInCurrentRole - The employee's accumulated years worked in their current role - intYearsSinceLastPromotion - The employee's accumulated years in this particular role without a promotion - intYearsWithCurrManager - The employee's accumulated years working with their current manager - intLet's get started
There is a lot of data here that needs to be categorized, labeled, or re-categorized.
import os
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import seaborn.objects as so
import matplotlib as mpl
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None
#List files in directory
print('The files in our directory are:')
for file_path in os.scandir('Data/'):
if file_path.is_file():
print(file_path.name)
The files in our directory are: IBM-HR-Employee-Attrition - Workbook.xlsx IBM-HR-Employee-Attrition.csv
#import the data
import pandas as pd
pd.options.mode.chained_assignment = None
#Specify our file names
imb_hr_csv = 'Data/IBM-HR-Employee-Attrition.csv'
#import Alfred data, and add filename column
df_ibm_hr = pd.read_csv(imb_hr_csv, low_memory=False) #Read file into dataframe
#Verify the data fields in each file
print('File Name:',imb_hr_csv)
df_ibm_hr.info()
File Name: Data/IBM-HR-Employee-Attrition.csv <class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
pd.set_option('display.max_columns', None) #Display all of the columns
display(df_ibm_hr)
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1465 | 36 | No | Travel_Frequently | 884 | Research & Development | 23 | 2 | Medical | 1 | 2061 | 3 | Male | 41 | 4 | 2 | Laboratory Technician | 4 | Married | 2571 | 12290 | 4 | Y | No | 17 | 3 | 3 | 80 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
| 1466 | 39 | No | Travel_Rarely | 613 | Research & Development | 6 | 1 | Medical | 1 | 2062 | 4 | Male | 42 | 2 | 3 | Healthcare Representative | 1 | Married | 9991 | 21457 | 4 | Y | No | 15 | 3 | 1 | 80 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
| 1467 | 27 | No | Travel_Rarely | 155 | Research & Development | 4 | 3 | Life Sciences | 1 | 2064 | 2 | Male | 87 | 4 | 2 | Manufacturing Director | 2 | Married | 6142 | 5174 | 1 | Y | Yes | 20 | 4 | 2 | 80 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
| 1468 | 49 | No | Travel_Frequently | 1023 | Sales | 2 | 3 | Medical | 1 | 2065 | 4 | Male | 63 | 2 | 2 | Sales Executive | 2 | Married | 5390 | 13243 | 2 | Y | No | 14 | 3 | 4 | 80 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
| 1469 | 34 | No | Travel_Rarely | 628 | Research & Development | 8 | 3 | Medical | 1 | 2068 | 2 | Male | 82 | 4 | 2 | Laboratory Technician | 3 | Married | 4404 | 10228 | 2 | Y | No | 12 | 3 | 1 | 80 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
1470 rows Γ 35 columns
What data do we need to wrangle?
Data represented in strings is very useful for us as humans to draw charts and understand easily, but if we want to do scientific analysis on most of this data, we need to transform the string fields to integers that represent those strings. For instance, Attrition as a "Yes" or a "No" is much easier to discern when reading a chart, but a computer will not understand it to analyze it in the way we want to. We can transform a "No" to a 0 and a "Yes" to a 1, then we can run algorithms like regressions on those easily.
We can also categorize Attrition the opposite way, "Yes" being 0 and "No" being 1. Whichever perspective we take we just have to ensure that we understand that while we're analyzing the data. In this case we will stick with 1 meaning "Yes".
We will need to categorize the following variables:
AttritionBusinessTravelDepartmentEducationFieldJobRoleMaritalStatusOverTimeLet's understand those data fields a little better.
df_ibm_hr_attrition = df_ibm_hr.groupby(df_ibm_hr.Attrition)['Attrition'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
display(df_ibm_hr_attrition)
df_ibm_hr_businesstravel = df_ibm_hr.groupby(df_ibm_hr.BusinessTravel)['BusinessTravel'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
display(df_ibm_hr_businesstravel)
df_ibm_hr_department = df_ibm_hr.groupby(df_ibm_hr.Department)['Department'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
display(df_ibm_hr_department)
df_ibm_hr_educationfield = df_ibm_hr.groupby(df_ibm_hr.EducationField)['EducationField'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
display(df_ibm_hr_educationfield)
df_ibm_hr_jobrole = df_ibm_hr.groupby(df_ibm_hr.JobRole)['JobRole'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
display(df_ibm_hr_jobrole)
df_ibm_hr_maritalstatus = df_ibm_hr.groupby(df_ibm_hr.MaritalStatus)['MaritalStatus'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
display(df_ibm_hr_maritalstatus)
df_ibm_hr_overtime = df_ibm_hr.groupby(df_ibm_hr.OverTime)['OverTime'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
display(df_ibm_hr_overtime)
| Attrition | Rows | |
|---|---|---|
| 0 | No | 1233 |
| 1 | Yes | 237 |
| BusinessTravel | Rows | |
|---|---|---|
| 0 | Non-Travel | 150 |
| 1 | Travel_Frequently | 277 |
| 2 | Travel_Rarely | 1043 |
| Department | Rows | |
|---|---|---|
| 0 | Human Resources | 63 |
| 1 | Research & Development | 961 |
| 2 | Sales | 446 |
| EducationField | Rows | |
|---|---|---|
| 0 | Human Resources | 27 |
| 1 | Life Sciences | 606 |
| 2 | Marketing | 159 |
| 3 | Medical | 464 |
| 4 | Other | 82 |
| 5 | Technical Degree | 132 |
| JobRole | Rows | |
|---|---|---|
| 0 | Healthcare Representative | 131 |
| 1 | Human Resources | 52 |
| 2 | Laboratory Technician | 259 |
| 3 | Manager | 102 |
| 4 | Manufacturing Director | 145 |
| 5 | Research Director | 80 |
| 6 | Research Scientist | 292 |
| 7 | Sales Executive | 326 |
| 8 | Sales Representative | 83 |
| MaritalStatus | Rows | |
|---|---|---|
| 0 | Divorced | 327 |
| 1 | Married | 673 |
| 2 | Single | 470 |
| OverTime | Rows | |
|---|---|---|
| 0 | No | 1054 |
| 1 | Yes | 416 |
I think we can leave out some of those variables for now.
Let's get rid of those variables with multiple. The only exception is BusinessTravel as that is easily categorized in ascending order with Non-Travel, Travel-Rarely, and Travel-Frequently.
#Make a new dataframe with only useful data
df_ibm_hr_filtered = df_ibm_hr[[
'Age'
,'Attrition'
,'BusinessTravel'
#,'Department'
,'DistanceFromHome'
,'Education'
,'EnvironmentSatisfaction'
,'Gender'
,'HourlyRate'
,'JobLevel'
,'JobSatisfaction'
#,'MaritalStatus'
,'MonthlyIncome'
,'NumCompaniesWorked'
,'OverTime'
,'PerformanceRating'
,'RelationshipSatisfaction'
,'StockOptionLevel'
,'TotalWorkingYears'
,'TrainingTimesLastYear'
,'WorkLifeBalance'
,'YearsAtCompany'
,'YearsInCurrentRole'
,'YearsSinceLastPromotion'
,'YearsWithCurrManager'
]]
df_ibm_hr_cat = df_ibm_hr_filtered
#Replace string data with integer data
df_ibm_hr_cat['Attrition'] = np.where(df_ibm_hr_cat['Attrition'] == 'Yes', 1, 0) #Categorize Attrition as "Yes" = 1, "No = 0
df_ibm_hr_cat['Gender'] = np.where(df_ibm_hr_cat['Gender'] == 'Female', 0, 1) #Categorize Gender alphabetically as Female = 0, Male = 1
df_ibm_hr_cat['OverTime'] = np.where(df_ibm_hr_cat['OverTime'] == 'Yes', 1, 0) #Categorize OverTime as "Yes" = 1, "No" = 0
df_ibm_hr_cat['BusinessTravel'] = df_ibm_hr_cat['BusinessTravel'].map({'Non-Travel': 0, 'Travel_Rarely': 1, 'Travel_Frequently': 2})
#df_ibm_hr_cat['BusinessTravel'] = np.where(df_ibm_hr_cat['BusinessTravel'] == 'Non-Travel', 0) #Categorize BusinessTravel
#df_ibm_hr_cat['BusinessTravel'] = np.where(df_ibm_hr_cat['BusinessTravel'] == 'Travel_Rarely', 1) #Categorize BusinessTravel
#df_ibm_hr_cat['BusinessTravel'] = np.where(df_ibm_hr_cat['BusinessTravel'] == 'Travel_Frequently', 2) #Categorize BusinessTravel
print("Our original dataset contains:")
display(df_ibm_hr_filtered)
print('\n')
print("And our codified dataset contains:")
display(df_ibm_hr_cat)
Our original dataset contains:
| Age | Attrition | BusinessTravel | DistanceFromHome | Education | EnvironmentSatisfaction | Gender | HourlyRate | JobLevel | JobSatisfaction | MonthlyIncome | NumCompaniesWorked | OverTime | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 1 | 1 | 1 | 2 | 2 | 0 | 94 | 2 | 4 | 5993 | 8 | 1 | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | 0 | 2 | 8 | 1 | 3 | 1 | 61 | 2 | 2 | 5130 | 1 | 0 | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | 1 | 1 | 2 | 2 | 4 | 1 | 92 | 1 | 3 | 2090 | 6 | 1 | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | 0 | 2 | 3 | 4 | 4 | 0 | 56 | 1 | 3 | 2909 | 1 | 1 | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | 0 | 1 | 2 | 1 | 1 | 1 | 40 | 1 | 2 | 3468 | 9 | 0 | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1465 | 36 | 0 | 2 | 23 | 2 | 3 | 1 | 41 | 2 | 4 | 2571 | 4 | 0 | 3 | 3 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
| 1466 | 39 | 0 | 1 | 6 | 1 | 4 | 1 | 42 | 3 | 1 | 9991 | 4 | 0 | 3 | 1 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
| 1467 | 27 | 0 | 1 | 4 | 3 | 2 | 1 | 87 | 2 | 2 | 6142 | 1 | 1 | 4 | 2 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
| 1468 | 49 | 0 | 2 | 2 | 3 | 4 | 1 | 63 | 2 | 2 | 5390 | 2 | 0 | 3 | 4 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
| 1469 | 34 | 0 | 1 | 8 | 3 | 2 | 1 | 82 | 2 | 3 | 4404 | 2 | 0 | 3 | 1 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
1470 rows Γ 23 columns
And our codified dataset contains:
| Age | Attrition | BusinessTravel | DistanceFromHome | Education | EnvironmentSatisfaction | Gender | HourlyRate | JobLevel | JobSatisfaction | MonthlyIncome | NumCompaniesWorked | OverTime | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 1 | 1 | 1 | 2 | 2 | 0 | 94 | 2 | 4 | 5993 | 8 | 1 | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | 0 | 2 | 8 | 1 | 3 | 1 | 61 | 2 | 2 | 5130 | 1 | 0 | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | 1 | 1 | 2 | 2 | 4 | 1 | 92 | 1 | 3 | 2090 | 6 | 1 | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | 0 | 2 | 3 | 4 | 4 | 0 | 56 | 1 | 3 | 2909 | 1 | 1 | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | 0 | 1 | 2 | 1 | 1 | 1 | 40 | 1 | 2 | 3468 | 9 | 0 | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1465 | 36 | 0 | 2 | 23 | 2 | 3 | 1 | 41 | 2 | 4 | 2571 | 4 | 0 | 3 | 3 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
| 1466 | 39 | 0 | 1 | 6 | 1 | 4 | 1 | 42 | 3 | 1 | 9991 | 4 | 0 | 3 | 1 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
| 1467 | 27 | 0 | 1 | 4 | 3 | 2 | 1 | 87 | 2 | 2 | 6142 | 1 | 1 | 4 | 2 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
| 1468 | 49 | 0 | 2 | 2 | 3 | 4 | 1 | 63 | 2 | 2 | 5390 | 2 | 0 | 3 | 4 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
| 1469 | 34 | 0 | 1 | 8 | 3 | 2 | 1 | 82 | 2 | 3 | 4404 | 2 | 0 | 3 | 1 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
1470 rows Γ 23 columns
Let's see if any data is correlated to another
I like to start with this step because we can gain some great insight into our data, and in turn our organization. For instance, maybe we hadn't thought about it before but Employee's with higher education levels might be correlated with one particular gender, meaning they are more likely to be one gender than the other.
What is correlation?
Correlation, in statistics, is a representation of the relationship between two things, or variables - they are represented on a scale from negative 1 to positive 1 based on the strength of the relationship in something called a "correlation coefficient". The closer the correlation coefficient is to positive or negative 1 the stronger that relationship is. The positivity or negativity depends on the context of the data itself.
Credit to Boston University School of Public Health for the following 2 images
from IPython.display import Image, display
display(Image(filename='Images/Correlation_Coefficients.png'))
display(Image(filename='Images/Correlation_Coefficient_Examples.png'))
# Compute the correlation matrix
corr = df_ibm_hr_cat.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 18))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, sep=7, s=95, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
heatmap_tri = sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0, annot=True, fmt='.3f',
square=True, linewidths=.5, cbar_kws={"shrink": .5})
heatmap_tri.set_title('Triangle Correlation Heatmap - IBM HR Data', fontdict={'fontsize':18}, pad=16);
Does anything interesting come out here?
It sure does! Here are some items of note:
EnvironmentSatisfaction scoresJobSatisfaction scoresLet's look right at our attrition data
The moment we've all been waiting for. Understanding our employees more completely is great, but right now we are very interested in understanding what factors lead to employee attrition.
df_ibm_hr_cat.corr()[['Attrition']].sort_values(by='Attrition', ascending=False)
plt.figure(figsize=(1, 6))
heatmap_attrition = sns.heatmap(df_ibm_hr_cat.corr()[['Attrition']].sort_values(by='Attrition', ascending=False), vmin=-1, vmax=1, annot=True, fmt='.3f', cmap=cmap)
heatmap_attrition.set_title('Features Correlating with Attrition', fontdict={'fontsize':18}, pad=16);
Very interesting data here
My first note is that nothing is outwardly strong in correlation with Attrition. This might discourage some but what I have found through other research is that if we know this, and we know our attrition rates for different groups of employees, say attrition rates of men and women, that if they show a significant gap to each other that we can still find these otherwise "weak" correlations as somewhat telling. What a weak correlation tends to mean in this case is that, those attrition rates may be close, maybe within a few percentage points, but we could still want to pay attention to those factors. All we have to do is reframe what we call "weak" and "strong" correlations to our results - so if our strongest correlation is 0.30, then that is now our "strong" correlation. I have found previously that my cutoff tends to be 0.05 for a relationship worthy of noting.
My second note is that it seems odd that OverTime is positively correlated with Attrition. When I think of something positive, I usually think of something being "good". That is until I remember that our context for Attrition is that "Yes", leaving the company, is represented by 1. In this case a 1 for an employee being authorized to work overtime correlate to 1 for "Yes". So what we are seeing is that employee's who are authorized to work overtime are, at some level, leaving the company more often. This is why it important to codify our data in a way that makes sense to us when we are thinking about the data and always ensure that when we are looking at and referencing analysis results that we keep those contexts in mind.
Looking at the rest, it seems like everything makes common sense:
StockOptionLevel stay more oftenInterestingly, there was no appreciable change in attrition for employees who had been rated higher or lower in terms of their performance.
Let's dig into that OverTime data a little further
We can visualize our data in another way that might make it more clear what we are talking about with correlations.
I will switch to aggregating the data and creating charts in Microsoft Excel as that is a familiar platform for me where I have done this type of research before.
display(Image(filename='Images/employee_attrition_by_overtime_auth.png'))
We can see here a large disparity in both the total number of employees who have been authorized for overtime, as well as a disparity in the proportion of employees authorized or not authorized for overtime who left the company. Let's look just at the attrition rates themselves, which would be the number of employees who left out of the total number of employees - in this case we are separating that by Overtime Authorization.
display(Image(filename='Images/employee_attrition_by_overtime_auth.png'))
Wow! It sure seems as if employees who are authorized to work overtime leave the company at a higher rate, 3 times higher, in fact, than employees who are not. This is a visual representation of our correlation coefficients - and keep in mind that the correlation factor between OverTime and Attrition was only 0.246, a correlation coefficient that is considered "weak" according to statistics literature.
So we can use correlation matrixes/heatmaps to quickly identify groups of employees that we want to investigate further, that we can see some connection between, and then dive into the data from there, using Pivot Tables to quickly manipulate the data into any many different frames of reference.
Let's dive in further, then.
We can also see a very weak correlation between OverTime and Gender so let's investigate that further.
display(Image(filename='Images/overtime_auth_by_gender.png'))
We don's see much of a difference here between the ratio of men and women who are authorized to work overtime, which is why the correlation coefficient is so weak, only -0.042. But let's see if there is a difference in attrition rates for men and women who have been authorized for overtime.
display(Image(filename='Images/attrition_rate_by_overtime_auth_and_gender.png'))
Now that is quite interesting, for employees who were not authorized for overtime there is a negligible different in attrition rates. But our men who are authorized for overtime leave at a considerably higher rate than our women who are authorized to work overtime, by 8 percentage points!
Conclusions
This came to be an interesting analysis. We found that there is a disparity in attrition rates between our company's employee's who are authorized to work overtime, and especilly a disparity in those employees based on their gender. This would warrant some further research into the experiences of men and women who are authorized to work overtime to find any reasons for that.
That is one area that this research lacks is that it only sees relationships, but not necessarily causal relationships. Someone may need to add more questions to an employee exit interview to investigate this further to determine a reason for this disparity we found.
In the future it is possible to look into every variable and every combination of variables. Of course we will strategically choose which to look further into to save time for ourselves based on some of our preliminary work along the way.