Naive Bayes
Predict Attrition & Salary
<!DOCTYPE html>
Case Study 2: Attrition and Salary EDA & Analysis
Vo Nguyen
2022-12-06
Introduction
YouTube Presentation: https://www.youtube.com/watch?v=3SXL037Iprc RShiny App:https://vochannguyen.shinyapps.io/VoShiny2/
We are working with DDSAnalytics to create a model to predict employee turnover using the employee data. We will performing multiple models to find the best fitting model to identity factors that lead to attrition. We will also identiy the top three factors that contribute to turnover. Additionally, we will predict salary using our given test dataset.
Case Study 2 Analysis Agenda:
1) Explore Graphs and Trends in Data for different possible factors of Attrition for Numerical and Categorical Responses 2) Determine Influential Factors in Attrition
3) Find the Best Model for Attrition - KNN or Naive Bayes 4) Run the Attrition Model using a Test Set 4) Run a Multiple Linear Regression for Salary Prediction using All Predictors and Use the Statistical Significant Predictors
My Top Three Predictors for Attrition
1) Monthly Income
2) Job Level
3) Overtime
Salary Predictors:
Data Overview
- Our dataset contains numerical and categorical variables. We will be using dummy variables for our categorical variables.
- We will remove the Over18 data from our analysis because it has one value of “Y”
- We will remove Standard Hours from our analysis because it contains one value of 80
- ID is a numerical variable to describe each observation, we are not going to use for our analysis
#Needed Libraries
library(XML)
library(dplyr)
library(tidyr)
library(stringi)
library(ggplot2)
library(class)
library(caret)
library(e1071)
library(stringr)
library(naniar)
library(rmarkdown)
library(readxl)
library(GGally)
#Read Data
employeeData <- read.csv("CaseStudy2-data.csv")
employeeData=employeeData
#Data Overview
str(employeeData)
## 'data.frame': 870 obs. of 36 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ Attrition : chr "No" "No" "No" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
## $ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Sales" ...
## $ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : int 4 3 2 4 1 2 5 4 4 4 ...
## $ EducationField : chr "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
## $ EnvironmentSatisfaction : int 2 3 3 3 1 4 2 4 3 4 ...
## $ Gender : chr "Male" "Male" "Male" "Female" ...
## $ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : int 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
## $ JobRole : chr "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
## $ JobSatisfaction : int 4 3 4 4 4 1 3 4 3 3 ...
## $ MaritalStatus : chr "Divorced" "Single" "Single" "Married" ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
## $ NumCompaniesWorked : int 2 1 2 1 1 1 2 2 1 1 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "No" "No" "No" "No" ...
## $ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : int 3 3 3 3 3 4 3 3 3 3 ...
## $ RelationshipSatisfaction: int 3 1 3 3 3 3 1 3 4 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 1 0 0 2 0 2 0 3 1 1 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : int 3 2 2 3 2 4 5 5 2 3 ...
## $ WorkLifeBalance : int 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
Checking for NA’s
There are no NAs in our dataset
#Check for NA's in each column dataset
colSums(is.na(employeeData))
## ID Age Attrition
## 0 0 0
## BusinessTravel DailyRate Department
## 0 0 0
## DistanceFromHome Education EducationField
## 0 0 0
## EmployeeCount EmployeeNumber EnvironmentSatisfaction
## 0 0 0
## Gender HourlyRate JobInvolvement
## 0 0 0
## JobLevel JobRole JobSatisfaction
## 0 0 0
## MaritalStatus MonthlyIncome MonthlyRate
## 0 0 0
## NumCompaniesWorked Over18 OverTime
## 0 0 0
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 0 0 0
## StandardHours StockOptionLevel TotalWorkingYears
## 0 0 0
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 0 0 0
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 0 0 0
Summary of Attrition
- There are more “No” than “Yes” in the Attrition column
- No (730) and Yes (140)
- Our first course of action to compare the Attrition to our Numerical Predictors that are related to money. We found that Job Level and Monthly Income had different level of means, which could contribute to Attrition.
#The count of Attrition of Yes and No
employeeData %>% count(Attrition)
## Attrition n
## 1 No 730
## 2 Yes 140
#Attrition Plot Count
employeeData %>% ggplot(aes(x=Attrition,fill=Attrition)) +
geom_bar()+
ggtitle("Attrition Count") +
xlab("Attrition")+ylab("Count")
### Pairs Plot for Attrition to Numerical Values
employeeData %>% select_if(is.numeric) %>% mutate(Attrition=employeeData$Attrition) %>% select(c(3,9,11,13,14,28)) %>% ggpairs(aes(colour = Attrition))
1st Influential Predictor: Monthly Income
Exploring Monthly Income
- According to our exploratory data analysis, we found that monthly income has a strong indication of Attrition.
- The histogram plot of Attrition count shows a right skew, but the data has an equal similar distribution for both yes and no.
- I performed the Welch’s Two-Sample T-test to determine mean different, and the results were that was the mean different is not zero
- In addition, the mean income of No is greater than the mean income of Yes. Additionally, I created a graph to compare Attrition to the Predictors that are influenced by money.
### Attrition Vs. MonthlyIncome
employeeData %>% ggplot(aes(x=MonthlyIncome,fill=Attrition))+
geom_histogram()+
ggtitle("Attrition Vs. MonthlyIncome")
### Mean Monthly Income of Attrition
employeeData %>% group_by(Attrition) %>% summarise(compareincomes=mean(MonthlyIncome))
## # A tibble: 2 × 2
## Attrition compareincomes
## <chr> <dbl>
## 1 No 6702
## 2 Yes 4765.
### Welch's Two-Sample T-test to determine Difference in means for Monthly Income
t.test(employeeData$MonthlyIncome~employeeData$Attrition,data=employeeData)
##
## Welch Two Sample t-test
##
## data: employeeData$MonthlyIncome by employeeData$Attrition
## t = 5.3249, df = 228.45, p-value = 2.412e-07
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 1220.382 2654.047
## sample estimates:
## mean in group No mean in group Yes
## 6702.000 4764.786
2nd Influential Predictor: Job Level
Exploring Job Level
- Job level has an affect on our model because if we look at our histogram, we can see some sort of right-skewness that equates to having more “Yes” when you’re Job Level is lower. This makes sense because if you are at the bottom of your Job Level, you are more likely to quit, as opposed to moving up on your job Level, which means higher man, you are probably less likely to quit your job.
- We plotted a jitter plot MOnthly Income vs. JobLevel, and found that there some distinct features of more “Yes” at the lower end of the Monthly Income and Job Levels.
- In addition, we performed a Welch’s Two-Sample T-test, and determine that there the mean difference is not zero.
### Attrition Vs. Job Level Histogram
employeeData %>% ggplot(aes(x=JobLevel,fill=Attrition))+
geom_histogram()+
ggtitle("Attrition Vs. JobLevel")
### Monthly Income Vs. Job Level Jitter Plot
employeeData %>% ggplot(aes(x=JobLevel,y=MonthlyIncome,fill=Attrition, color=Attrition))+
geom_jitter(stat="identity")+
ggtitle("MonthlyIncome Vs. JobLevel")
### Welch's Two Sample T-test for Job Level
t.test(employeeData$JobLevel~employeeData$Attrition,data=employeeData)
##
## Welch Two Sample t-test
##
## data: employeeData$JobLevel by employeeData$Attrition
## t = 5.231, df = 211.76, p-value = 4.042e-07
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 0.2995698 0.6618784
## sample estimates:
## mean in group No mean in group Yes
## 2.116438 1.635714
3rd Influential Predictor: OverTime
Exploring Overtime
- Our third influential predictor is the categorical variable “Overtime.”
- Over has the response “Yes” or “No”
- Overtime is cleared skewed in that more people who have overtime will tend to quit.
- Overtime compared to the other cateogrical variables has a different mean among the Yes and No.
### Attrition Vs. OverTime
employeeData %>%
ggplot(aes(x=OverTime,fill=Attrition))+
geom_bar(position="fill")+ggtitle("Attrition Vs. Overtime")+
scale_y_continuous(labels = scales::percent)
EDA on Other Categorical Variables
When we graphs the rest of the categorical variables, we saw some interesting trends. Sales Representatives tend to quit more than the other job roles. Job Satisfaction is pretty even in the “Yes”. Single people tend to quit more than Married or Divorced. Relationship Satisfaction is even as well. Business Travel has a larger population in “Yes” for Travel Rarely, but I believe that Overtime plays more of a major role in Attrition.
### Percentage Compares for Job Role
ggplot(employeeData, aes(x = JobRole, fill = Attrition)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
### Attrition Vs. Job Satisfaction
employeeData %>%
ggplot(aes(x=JobSatisfaction,fill=Attrition))+
geom_bar()+
ggtitle("Attrition Vs. Job Satisfaction")
### Attrition vs Marital Status
employeeData %>%
ggplot(aes(x=MaritalStatus,fill=Attrition))+
geom_bar(position="fill")+
ggtitle("Attrition Vs. Marital Status")
### Attrition Vs. RelationshipSatisfaction
employeeData %>%
ggplot(aes(x=RelationshipSatisfaction,fill=Attrition))+
geom_histogram()+ggtitle("Attrition Vs. RelationshipSatisfaction")+
scale_y_continuous(labels = scales::percent)
### Attrition Vs. BusinessTravel
employeeData %>%
ggplot(aes(x=BusinessTravel,fill=Attrition))+
geom_bar()+ggtitle("Attrition Vs. BusinessTravel")+
scale_y_continuous(labels = scales::percent)
Data Prep - Cleanip and Wrangling
- Dummy Variable the Categorical Variables
- Overtime is Changed to 0 or 1
- Scaled Age, Monthly Income, Hourly Rate, Monthly Rate, Percent Salary hike, and Daily Rate
# Created Dataset for Naive Bayes
employeeData3 = read.csv("CaseStudy2-data.csv")
# Make overtime and attrition column binary
employeeData3$OverTime = ifelse(employeeData$OverTime=="Yes",1,0)
# Scaled Age, Monthly Income, Hourly Rate, Monthly Rate, Percent Salary hike, and Daily Rate
employeeData3$NAge=scale(employeeData3$Age)
employeeData3$NMonthylyIncome=scale(employeeData3$MonthlyIncome)
employeeData3$NHourlyRate=scale(employeeData3$HourlyRate)
employeeData3$NMonthlyRate=scale(employeeData3$MonthlyRate)
employeeData3$NPercentSalaryHike=scale(employeeData3$PercentSalaryHike)
employeeData3$NDailyRate=scale(employeeData3$DailyRate)
# Created Dummy Variables for Business Travel
employeeData3$BTNone = ifelse(employeeData$BusinessTravel=="Non-Travel",1,0)
employeeData3$BTRare=ifelse(employeeData$BusinessTravel=="Travel_Rarely",1,0)
employeeData3$BTFreq=ifelse(employeeData$BusinessTravel=="Travel_Frequently",1,0)
# Created Dummy Variables for Departments
employeeData3$DepHR=ifelse(employeeData$Department=="Human Resources",1,0)
employeeData3$DepSales=ifelse(employeeData$Department=="Sales",1,0)
employeeData3$DepRD=ifelse(employeeData$Department=="Research & Development",1,0)
# Created Dummy Variables for Education Field
employeeData3$EFHR=ifelse(employeeData$EducationField=="Human Resources",1,0)
employeeData3$EFLS=ifelse(employeeData$EducationField=="Life Sciences",1,0)
employeeData3$EFM=ifelse(employeeData$EducationField=="Marketing",1,0)
employeeData3$EFMed=ifelse(employeeData$EducationField=="Medical",1,0)
employeeData3$EFT=ifelse(employeeData$EducationField=="Technical Degree",1,0)
employeeData3$EFOther=ifelse(employeeData$EducationField=="Other",1,0)
# Created Dummy Variables for Gender
employeeData3$Male=ifelse(employeeData$Gender=="Male",1,0)
employeeData3$Female=ifelse(employeeData$Gender=="Female",1,0)
# Created Dummy Variables for Job Roles
employeeData3$JRHR=ifelse(employeeData$JobRole=="Healthcare Representative",1,0)
employeeData3$JRLT=ifelse(employeeData$JobRole=="Laboratory Technician",1,0)
employeeData3$JRManager=ifelse(employeeData$JobRole=="Manager",1,0)
employeeData3$JRMD=ifelse(employeeData$JobRole=="Manufacturing Director",1,0)
employeeData3$JRRD=ifelse(employeeData$JobRole=="Research Director",1,0)
employeeData3$JRRS=ifelse(employeeData$JobRole=="Research Scientist",1,0)
employeeData3$JRSE=ifelse(employeeData$JobRole=="Sales Executive",1,0)
employeeData3$JRSR=ifelse(employeeData$JobRole=="Sales Representative",1,0)
# Created Dummy Variables for Marital Status
employeeData3$Divorced=ifelse(employeeData$MaritalStatus=="Divorced",1,0)
employeeData3$Single=ifelse(employeeData$MaritalStatus=="Single",1,0)
employeeData3$Married=ifelse(employeeData$MaritalStatus=="Married",1,0)
# Created Dummy Variables for Supervisor roles vs Non-Supervisor Roles
employeeData3$JR1 = ifelse(employeeData$JobRole=="Manager"|employeeData$JobRole=="Research Director",1,0)
Attrition Model: Naive Bayes
Initial Analysis: All Predictors
In our data modeling comparisons, we found that Naive Bayes was the best model. We first use Naive Bayes using all the predictors. We found that the Naive Bayes gave us the best Sensitivity 0.92, Specificity 0.34, and Accuracy of 0.7088. This fails to meet our condition of meeting at least 60% on both sensitivity and specificity. In our train test set, we used a 70-30 split in our dataset to model using Naive Bayes. Now we are going to do a forward selection by hand on which predictors are the best one by one.
set.seed(13)
# Naive Bayes Model (Selecting All Variables including scaled Continuous and Categorical Variables) - Ignoring Already Address Variables that do not fit the model
naive_data=employeeData3
model2 = naive_data %>% select(c("NAge","NDailyRate","DistanceFromHome", "EnvironmentSatisfaction", "NHourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "NMonthylyIncome", "NMonthlyRate", "NumCompaniesWorked", "OverTime", "NPercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "YearsAtCompany", "YearsInCurrentRole","YearsSinceLastPromotion", "YearsWithCurrManager", "BTRare","BTFreq","BTNone","DepHR","DepSales","EFHR","EFLS","EFM","EFMed","EFT","EFOther","Male","Female","JRHR","JRLT" ,"JRManager","JRMD","JRRD","JRRS","JRSE","JRSR", "Divorced","Single","Married","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","Attrition"))
model2$Attrition = as.factor(model2$Attrition)
trainIndices = sample(1:dim(model2)[1],round(.70 * dim(model2)[1]))
train = model2[trainIndices,]
test = model2[-trainIndices,]
classifier1 = naiveBayes(model2[,c(1:47)],model2$Attrition)
pred = predict(classifier1,newdata=test)
confusionMatrix(table(test$Attrition,pred))
## Confusion Matrix and Statistics
##
## pred
## No Yes
## No 151 64
## Yes 12 34
##
## Accuracy : 0.7088
## 95% CI : (0.6496, 0.7632)
## No Information Rate : 0.6245
## P-Value [Acc > NIR] : 0.002628
##
## Kappa : 0.3057
##
## Mcnemar's Test P-Value : 4.913e-09
##
## Sensitivity : 0.9264
## Specificity : 0.3469
## Pos Pred Value : 0.7023
## Neg Pred Value : 0.7391
## Prevalence : 0.6245
## Detection Rate : 0.5785
## Detection Prevalence : 0.8238
## Balanced Accuracy : 0.6367
##
## 'Positive' Class : No
##
Final Attrition Model Analysis
Best Model: Naive Bayes
In our data modeling comparisons, we found that the Naive Bayes gave us the best Sensitivity 0.8941, Specificity 0.8400, and Accuracy of 0.8889. In our train test set, we used a 70-30 split in our dataset to model using Naive Bayes. We found these predictors to be our best model by using Forward Selection. I picked one variable at a time to add into our model, if it increased our Sensitivity, Specificity, and Accuracy, I kept it, and went onto the next variable. Best Predictor Variables: JobLevel, NMonthylyIncome, NumCompaniesWorked, NMonthlyRate, OverTime, PerformanceRating, YearsWithCurrManager, RelationshipSatisfaction, YearsAtCompany, YearsSinceLastPromotion, BTRare
set.seed(13)
# Naive Bayes Model (Selecting All Variables including scaled Continuous and Categorical Variables) - Ignoring Already Address Variables that do not fit the model
naive_data=employeeData3
model2 = naive_data %>% select(c("NAge","NDailyRate","DistanceFromHome", "EnvironmentSatisfaction", "NHourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "NMonthylyIncome", "NMonthlyRate", "NumCompaniesWorked", "OverTime", "NPercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "YearsAtCompany", "YearsInCurrentRole","YearsSinceLastPromotion", "YearsWithCurrManager", "BTRare","BTFreq","BTNone","DepHR","DepSales","EFHR","EFLS","EFM","EFMed","EFT","EFOther","Male","Female","JRHR","JRLT" ,"JRManager","JRMD","JRRD","JRRS","JRSE","JRSR", "Divorced","Single","Married","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","Attrition"))
model2$Attrition = as.factor(model2$Attrition)
trainIndices = sample(1:dim(model2)[1],round(.70 * dim(model2)[1]))
train = model2[trainIndices,]
test = model2[-trainIndices,]
classifier1 = naiveBayes(model2[,c(7,9,12,6,11,10,13,15,20,21,23,24,26,28)],model2$Attrition)
pred = predict(classifier1,newdata=test)
confusionMatrix(table(test$Attrition,pred))
## Confusion Matrix and Statistics
##
## pred
## No Yes
## No 211 4
## Yes 25 21
##
## Accuracy : 0.8889
## 95% CI : (0.8443, 0.9243)
## No Information Rate : 0.9042
## P-Value [Acc > NIR] : 0.8290315
##
## Kappa : 0.5337
##
## Mcnemar's Test P-Value : 0.0002041
##
## Sensitivity : 0.8941
## Specificity : 0.8400
## Pos Pred Value : 0.9814
## Neg Pred Value : 0.4565
## Prevalence : 0.9042
## Detection Rate : 0.8084
## Detection Prevalence : 0.8238
## Balanced Accuracy : 0.8670
##
## 'Positive' Class : No
##
2nd Attrition Model: KNN
Initial Analysis: All Predictor Variables
We performed a KNN model to train our dataset, but failed to meet the requirement of 60% for both sensitivity and specificity. Our KNN model gave us Thus, we will move onto Naive Bayes model.
model = employeeData3 %>% select(c("NAge","NDailyRate","DistanceFromHome", "EnvironmentSatisfaction", "NHourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "NMonthylyIncome", "NMonthlyRate", "NumCompaniesWorked", "OverTime", "NPercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "YearsAtCompany", "YearsInCurrentRole","YearsSinceLastPromotion", "YearsWithCurrManager", "BTRare","BTFreq","BTNone","DepHR","DepSales","DepRD","EFHR","EFLS","EFM","EFMed","EFT","EFOther","Male","Female","JRHR","JRLT" ,"JRManager","JRMD","JRRD","JRRS","JRSE","JRSR", "Divorced","Single","Married","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","Attrition"))
set.seed(13)
iterations = 200
numks = 20
splitPerc = .70
masterAcc = matrix(nrow = iterations, ncol = numks)
for(j in 1:iterations)
{
trainIndices = sample(1:dim(model)[1],round(splitPerc * dim(model)[1]))
train = model[trainIndices,]
test = model[-trainIndices,]
for(i in 1:numks)
{
classifications = knn(train[,c(1:47)],test[,c(1:47)],train$Attrition, prob = TRUE, k = i)
table(classifications,test$Attrition)
CM = confusionMatrix(table(classifications,test$Attrition))
masterAcc[j,i] = CM$overall[1]
}
}
MeanAcc = colMeans(masterAcc)
plot(seq(1,numks,1),MeanAcc, type = "l")
which.max(MeanAcc)
## [1] 11
classifications = knn(train[,c(1:47)],test[,c(1:47)],train$Attrition, prob = TRUE, k = which.max(MeanAcc))
table(classifications,test$Attrition)
##
## classifications No Yes
## No 220 32
## Yes 3 6
confusionMatrix(table(classifications,test$Attrition))
## Confusion Matrix and Statistics
##
##
## classifications No Yes
## No 220 32
## Yes 3 6
##
## Accuracy : 0.8659
## 95% CI : (0.8185, 0.9048)
## No Information Rate : 0.8544
## P-Value [Acc > NIR] : 0.3366
##
## Kappa : 0.2113
##
## Mcnemar's Test P-Value : 2.214e-06
##
## Sensitivity : 0.9865
## Specificity : 0.1579
## Pos Pred Value : 0.8730
## Neg Pred Value : 0.6667
## Prevalence : 0.8544
## Detection Rate : 0.8429
## Detection Prevalence : 0.9655
## Balanced Accuracy : 0.5722
##
## 'Positive' Class : No
##
KNN Model
Final Analysis: The best Predictor Variablesare: JobLevel, NMonthylyIncome, NumCompaniesWorked, NMonthlyRate, OverTime, PerformanceRating, YearsWithCurrManager, RelationshipSatisfaction, YearsAtCompany, YearsSinceLastPromotion, BTRare
model = employeeData3 %>% select(c("NAge","NDailyRate","DistanceFromHome", "EnvironmentSatisfaction", "NHourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "NMonthylyIncome", "NMonthlyRate", "NumCompaniesWorked", "OverTime", "NPercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "YearsAtCompany", "YearsInCurrentRole","YearsSinceLastPromotion", "YearsWithCurrManager", "BTRare","BTFreq","BTNone","DepHR","DepSales","DepRD","EFHR","EFLS","EFM","EFMed","EFT","EFOther","Male","Female","JRHR","JRLT" ,"JRManager","JRMD","JRRD","JRRS","JRSE","JRSR", "Divorced","Single","Married","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","Attrition"))
head(model)
## NAge NDailyRate DistanceFromHome EnvironmentSatisfaction NHourlyRate
## 1 -0.5409772 -1.74071116 13 2 0.3669771
## 2 0.3552859 1.22850265 14 3 -1.0738619
## 3 -0.2048785 -1.53378862 18 3 -0.2789163
## 4 -0.5409772 -0.03546998 1 3 -0.8751255
## 5 -1.4372403 -0.61884196 2 1 -1.6700711
## 6 -1.1011416 -1.29944261 10 4 -1.6700711
## JobInvolvement JobLevel JobSatisfaction NMonthylyIncome NMonthlyRate
## 1 3 2 4 -0.4322305 -0.7140332
## 2 2 5 3 2.8787757 0.4527584
## 3 3 3 4 0.6463532 0.7903879
## 4 3 3 4 0.8769035 1.3654837
## 5 3 1 4 -0.5720831 0.4068970
## 6 3 3 1 0.5225956 -1.3387886
## NumCompaniesWorked OverTime NPercentSalaryHike PerformanceRating
## 1 2 0 -1.1427203 3
## 2 1 0 -0.3264915 3
## 3 2 0 -1.1427203 3
## 4 1 0 1.0338898 3
## 5 1 1 -0.5985678 3
## 6 1 0 1.5780423 4
## RelationshipSatisfaction YearsAtCompany YearsInCurrentRole
## 1 3 5 2
## 2 1 20 7
## 3 3 2 2
## 4 3 14 10
## 5 3 6 3
## 6 3 9 7
## YearsSinceLastPromotion YearsWithCurrManager BTRare BTFreq BTNone DepHR
## 1 0 3 1 0 0 0
## 2 4 9 1 0 0 0
## 3 2 2 0 1 0 0
## 4 5 7 1 0 0 0
## 5 1 3 0 1 0 0
## 6 1 7 0 1 0 0
## DepSales DepRD EFHR EFLS EFM EFMed EFT EFOther Male Female JRHR JRLT
## 1 1 0 0 1 0 0 0 0 1 0 0 0
## 2 0 1 0 0 0 1 0 0 1 0 0 0
## 3 0 1 0 1 0 0 0 0 1 0 0 0
## 4 1 0 0 0 1 0 0 0 0 1 0 0
## 5 0 1 0 0 0 0 1 0 0 1 0 0
## 6 0 1 0 1 0 0 0 0 1 0 0 0
## JRManager JRMD JRRD JRRS JRSE JRSR Divorced Single Married StockOptionLevel
## 1 0 0 0 0 1 0 1 0 0 1
## 2 0 0 1 0 0 0 0 1 0 0
## 3 0 1 0 0 0 0 0 1 0 0
## 4 0 0 0 0 1 0 0 0 1 2
## 5 0 0 0 1 0 0 0 1 0 0
## 6 0 1 0 0 0 0 1 0 0 2
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance Attrition
## 1 8 3 2 No
## 2 21 2 4 No
## 3 10 2 3 No
## 4 14 3 3 No
## 5 6 2 3 No
## 6 9 4 2 No
iterations = 1
numks = 20
splitPerc = .70
masterAcc = matrix(nrow = iterations, ncol = numks)
for(j in 1:iterations)
{
trainIndices = sample(1:dim(model)[1],round(splitPerc * dim(model)[1]))
train = model[trainIndices,]
test = model[-trainIndices,]
for(i in 1:numks)
{
classifications = knn(train[,c(7,9,12,6,11,10,13,15,20,21,23,24,26,28)],test[,c(7,9,12,6,11,10,13,15,20,21,23,24,26,28)],train$Attrition, prob = TRUE, k = i)
table(classifications,test$Attrition)
CM = confusionMatrix(table(classifications,test$Attrition))
masterAcc[j,i] = CM$overall[1]
}
}
MeanAcc = colMeans(masterAcc)
plot(seq(1,numks,1),MeanAcc, type = "l")
which.max(MeanAcc)
## [1] 4
max(MeanAcc)
## [1] 0.8237548
classifications = knn(train[,c(7,9,12,6,11,10,13,15,20,21,23,24,26,28)],test[,c(7,9,12,6,11,10,13,15,20,21,23,24,26,28)],train$Attrition, prob = TRUE, k = which.max(MeanAcc))
table(classifications,test$Attrition)
##
## classifications No Yes
## No 208 38
## Yes 5 10
confusionMatrix(table(classifications,test$Attrition))
## Confusion Matrix and Statistics
##
##
## classifications No Yes
## No 208 38
## Yes 5 10
##
## Accuracy : 0.8352
## 95% CI : (0.7846, 0.8781)
## No Information Rate : 0.8161
## P-Value [Acc > NIR] : 0.2386
##
## Kappa : 0.2519
##
## Mcnemar's Test P-Value : 1.061e-06
##
## Sensitivity : 0.9765
## Specificity : 0.2083
## Pos Pred Value : 0.8455
## Neg Pred Value : 0.6667
## Prevalence : 0.8161
## Detection Rate : 0.7969
## Detection Prevalence : 0.9425
## Balanced Accuracy : 0.5924
##
## 'Positive' Class : No
##
Attrition Model Conclusion
Naive Bayes with the Predctors of JobLevel, MonthlyIncome, NumCompaniesWorked, MonthlyRate, OverTime, PerformanceRating, YearsWithCurrManager, RelationshipSatisfaction, YearsAtCompany, YearsSinceLastPromotion, BTRare is the best predictive model of Attrition.
Top Three Factors of Attrition:
- Monthly Income
- Job Role
- Overtime
Predicting Attrition using Test Set
Naive Bayes Model Predictors: JobLevel, MonthlyIncome, NumCompaniesWorked, MonthlyRate, OverTime, PerformanceRating, YearsWithCurrManager, RelationshipSatisfaction, YearsAtCompany, YearsSinceLastPromotion, BTRare
### Adjusting the No Attrition Dataset to Fit our Model
employeetestdata=read.csv("CaseStudy2CompSetNoAttrition.csv")
employeetestdata = employeetestdata
employeetestdata$NAge=scale(employeetestdata$Age)
employeetestdata$NMonthylyIncome=scale(employeetestdata$MonthlyIncome)
employeetestdata$NHourlyRate=scale(employeetestdata$HourlyRate)
employeetestdata$NMonthlyRate=scale(employeetestdata$MonthlyRate)
employeetestdata$NPercentSalaryHike=scale(employeetestdata$PercentSalaryHike)
employeetestdata$NDailyRate=scale(employeetestdata$DailyRate)
employeetestdata$OverTime = ifelse(employeetestdata$OverTime=="Yes",1,0)
employeetestdata$BTNone = ifelse(employeetestdata$BusinessTravel=="Non-Travel",1,0)
employeetestdata$BTRare=ifelse(employeetestdata$BusinessTravel=="Travel_Rarely",1,0)
employeetestdata$BTFreq=ifelse(employeetestdata$BusinessTravel=="Travel_Frequently",1,0)
employeetestdata$DepHR=ifelse(employeetestdata$Department=="Human Resources",1,0)
employeetestdata$DepSales=ifelse(employeetestdata$Department=="Sales",1,0)
employeetestdata$EFHR=ifelse(employeetestdata$EducationField=="Human Resources",1,0)
employeetestdata$EFLS=ifelse(employeetestdata$EducationField=="Life Sciences",1,0)
employeetestdata$EFM=ifelse(employeetestdata$EducationField=="Marketing",1,0)
employeetestdata$EFMed=ifelse(employeetestdata$EducationField=="Medical",1,0)
employeetestdata$EFT=ifelse(employeetestdata$EducationField=="Technical Degree",1,0)
employeetestdata$EFOther=ifelse(employeetestdata$EducationField=="Other",1,0)
employeetestdata$Male=ifelse(employeetestdata$Gender=="Male",1,0)
employeetestdata$Female=ifelse(employeetestdata$Gender=="Female",1,0)
employeetestdata$JRHR=ifelse(employeetestdata$JobRole=="Healthcare Representative",1,0)
employeetestdata$JRLT=ifelse(employeetestdata$JobRole=="Laboratory Technician",1,0)
employeetestdata$JRManager=ifelse(employeetestdata$JobRole=="Manager",1,0)
employeetestdata$JRMD=ifelse(employeetestdata$JobRole=="Manufacturing Director",1,0)
employeetestdata$JRRD=ifelse(employeetestdata$JobRole=="Research Director",1,0)
employeetestdata$JRRS=ifelse(employeetestdata$JobRole=="Research Scientist",1,0)
employeetestdata$JRSE=ifelse(employeetestdata$JobRole=="Sales Executive",1,0)
employeetestdata$JRSR=ifelse(employeetestdata$JobRole=="Sales Representative",1,0)
employeetestdata$Divorced=ifelse(employeetestdata$MaritalStatus=="Divorced",1,0)
employeetestdata$Single=ifelse(employeetestdata$MaritalStatus=="Single",1,0)
employeetestdata$Married=ifelse(employeetestdata$MaritalStatus=="Married",1,0)
Naive Prediction Model with the No Attrtion Test Set
prednoattrition = predict(classifier1,newdata=employeetestdata)
employeetestdata$Attrition=unlist(prednoattrition)
Case2PredictionsAttritionNguyen = data.frame(c(employeetestdata$ID),c(employeetestdata$Attrition))
#Prediction Attrition Set - See CaseStudyPredictionAttrition in Github for full R code
head(Case2PredictionsAttritionNguyen)
## c.employeetestdata.ID. c.employeetestdata.Attrition.
## 1 1171 No
## 2 1172 No
## 3 1173 No
## 4 1174 No
## 5 1175 No
## 6 1176 No
### Prediction Attrition of Test Set
employeetestdata %>% ggplot(aes(x=Attrition,fill=Attrition)) +
geom_bar()+
ggtitle("Predicted Attrition Count") +
xlab("Attrition")+ylab("Count")
###Count of "Yes" and "No"
employeetestdata %>% count(Attrition)
## Attrition n
## 1 No 274
## 2 Yes 26
Predicting Monthly Income - Multiple Linear Regression
For the second part of our assignment, we will be running a prediction on Salary. We are going to run a multiple linear regression to find the best predictor that are statistically significant. From there we will choose those and run an interaction to find more statistically significant values.
Multiple Linear Regression Model of All Other Variables (Excluding Attrition)
### Multiple Linear Regression using All Predictors
lmsalary_model1=lm(MonthlyIncome~.,data=lm_salarydf)
### Summary of the Linear Model
summary(lmsalary_model1)
##
## Call:
## lm(formula = MonthlyIncome ~ ., data = lm_salarydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3650.2 -667.7 0.8 629.1 4147.2
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.980e+01 7.287e+02 0.082 0.93461
## Age -2.207e+00 5.593e+00 -0.395 0.69320
## DailyRate 1.468e-01 9.135e-02 1.607 0.10837
## DistanceFromHome -6.960e+00 4.568e+00 -1.524 0.12798
## EnvironmentSatisfaction -3.612e+00 3.367e+01 -0.107 0.91459
## HourlyRate -3.674e-01 1.827e+00 -0.201 0.84069
## JobInvolvement 1.737e+01 5.327e+01 0.326 0.74449
## JobLevel 2.787e+03 8.351e+01 33.371 < 2e-16 ***
## JobSatisfaction 2.662e+01 3.338e+01 0.798 0.42532
## MonthlyRate -9.082e-03 5.144e-03 -1.765 0.07785 .
## NumCompaniesWorked 3.117e+00 1.681e+01 0.185 0.85296
## OverTime -1.294e+01 8.441e+01 -0.153 0.87822
## PercentSalaryHike 2.467e+01 1.582e+01 1.559 0.11928
## PerformanceRating -3.185e+02 1.615e+02 -1.972 0.04890 *
## RelationshipSatisfaction 1.705e+01 3.329e+01 0.512 0.60875
## YearsAtCompany -4.482e+00 1.363e+01 -0.329 0.74240
## YearsInCurrentRole 5.708e+00 1.703e+01 0.335 0.73759
## YearsSinceLastPromotion 2.983e+01 1.532e+01 1.947 0.05184 .
## YearsWithCurrManager -2.674e+01 1.667e+01 -1.604 0.10900
## BTRare 3.747e+02 1.201e+02 3.119 0.00188 **
## BTFreq 1.939e+02 1.422e+02 1.364 0.17303
## BTNone NA NA NA NA
## DepHR -1.290e+02 4.773e+02 -0.270 0.78708
## DepSales -5.645e+02 3.312e+02 -1.705 0.08865 .
## DepRD NA NA NA NA
## EFHR -8.487e+01 3.950e+02 -0.215 0.82995
## EFLS 5.762e+01 1.600e+02 0.360 0.71879
## EFM 2.342e+01 1.977e+02 0.118 0.90576
## EFMed -4.998e+01 1.634e+02 -0.306 0.75973
## EFT 1.235e+01 1.955e+02 0.063 0.94964
## EFOther NA NA NA NA
## Male 1.111e+02 7.453e+01 1.491 0.13625
## Female NA NA NA NA
## JRHR 1.825e+02 5.150e+02 0.354 0.72315
## JRLT -4.160e+02 4.960e+02 -0.839 0.40189
## JRManager 4.459e+03 4.943e+02 9.022 < 2e-16 ***
## JRMD 3.585e+02 5.117e+02 0.701 0.48377
## JRRD 4.223e+03 5.523e+02 7.646 5.76e-14 ***
## JRRS -1.686e+02 4.951e+02 -0.340 0.73360
## JRSE 6.920e+02 5.107e+02 1.355 0.17582
## JRSR 2.699e+02 5.168e+02 0.522 0.60171
## Divorced -6.652e+01 1.001e+02 -0.665 0.50640
## Single -5.373e+01 1.028e+02 -0.523 0.60138
## Married NA NA NA NA
## StockOptionLevel 3.185e+00 5.693e+01 0.056 0.95540
## TotalWorkingYears 5.176e+01 1.098e+01 4.716 2.82e-06 ***
## TrainingTimesLastYear 2.473e+01 2.914e+01 0.849 0.39630
## WorkLifeBalance -3.669e+01 5.168e+01 -0.710 0.47801
## AttritionYes 8.447e+01 1.155e+02 0.731 0.46485
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1057 on 826 degrees of freedom
## Multiple R-squared: 0.9498, Adjusted R-squared: 0.9471
## F-statistic: 363.2 on 43 and 826 DF, p-value: < 2.2e-16
Statistical Significant P-values < .05:
JobLeve, MonthlyRate, PerformanceRating, YearsSinceLastPromotion, BTRare, DepSales, JRManager, JRRD, and TotalWorkingYears were all values where the p-value was close to < .05.
We are now going to run a model with those variables.
Results: RSME = 1062, Adjusted R^2 = 0.9466
lmsalary_model2=lm(MonthlyIncome~JobLevel+MonthlyRate+PerformanceRating+BTRare+YearsSinceLastPromotion+DepSales+JRMD+JRManager+JRRD+TotalWorkingYears,data=lm_salarydf)
summary(lmsalary_model2)
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + MonthlyRate + PerformanceRating +
## BTRare + YearsSinceLastPromotion + DepSales + JRMD + JRManager +
## JRRD + TotalWorkingYears, data = lm_salarydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3895.0 -601.3 -50.3 640.0 4216.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.338e+02 3.483e+02 -0.958 0.338207
## JobLevel 2.940e+03 6.764e+01 43.466 < 2e-16 ***
## MonthlyRate -9.172e-03 5.096e-03 -1.800 0.072232 .
## PerformanceRating -1.152e+02 1.010e+02 -1.141 0.254093
## BTRare 2.704e+02 7.981e+01 3.388 0.000735 ***
## YearsSinceLastPromotion 1.695e+01 1.283e+01 1.321 0.186922
## DepSales 9.721e+01 8.806e+01 1.104 0.269932
## JRMD 3.594e+02 1.360e+02 2.644 0.008349 **
## JRManager 3.943e+03 2.066e+02 19.089 < 2e-16 ***
## JRRD 4.054e+03 2.027e+02 20.005 < 2e-16 ***
## TotalWorkingYears 4.161e+01 8.312e+00 5.006 6.75e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1062 on 859 degrees of freedom
## Multiple R-squared: 0.9472, Adjusted R-squared: 0.9466
## F-statistic: 1541 on 10 and 859 DF, p-value: < 2.2e-16
Now we are going to run another model with those variables and their interactions
Results: RSME of 1047, Adjusted R^2 = 0.9481
lmsalary_model3=lm(MonthlyIncome~(JobLevel+MonthlyRate+PerformanceRating+BTRare+YearsSinceLastPromotion+DepSales+JRMD+JRManager+JRRD+TotalWorkingYears)^2,data=lm_salarydf)
summary(lmsalary_model3)
##
## Call:
## lm(formula = MonthlyIncome ~ (JobLevel + MonthlyRate + PerformanceRating +
## BTRare + YearsSinceLastPromotion + DepSales + JRMD + JRManager +
## JRRD + TotalWorkingYears)^2, data = lm_salarydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3338.8 -631.5 -70.9 607.0 4323.1
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 8.990e+02 1.289e+03 0.697
## JobLevel 1.621e+03 7.682e+02 2.110
## MonthlyRate 1.774e-02 4.920e-02 0.360
## PerformanceRating -3.266e+02 3.913e+02 -0.835
## BTRare -4.122e+01 7.672e+02 -0.054
## YearsSinceLastPromotion 5.808e+00 1.367e+02 0.042
## DepSales 3.387e+02 9.243e+02 0.366
## JRMD -8.929e+02 1.378e+03 -0.648
## JRManager 6.713e+03 2.756e+03 2.436
## JRRD 5.981e+03 2.545e+03 2.351
## TotalWorkingYears 6.540e+01 8.943e+01 0.731
## JobLevel:MonthlyRate 1.292e-02 1.013e-02 1.276
## JobLevel:PerformanceRating 2.166e+02 2.323e+02 0.932
## JobLevel:BTRare 1.407e+01 1.578e+02 0.089
## JobLevel:YearsSinceLastPromotion 4.050e+01 2.038e+01 1.987
## JobLevel:DepSales 3.222e+02 1.569e+02 2.053
## JobLevel:JRMD 3.690e+02 2.651e+02 1.392
## JobLevel:JRManager -1.682e+02 3.141e+02 -0.535
## JobLevel:JRRD 5.409e+01 2.759e+02 0.196
## JobLevel:TotalWorkingYears 1.417e+01 7.890e+00 1.796
## MonthlyRate:PerformanceRating -5.022e-03 1.438e-02 -0.349
## MonthlyRate:BTRare 4.395e-03 1.139e-02 0.386
## MonthlyRate:YearsSinceLastPromotion 1.068e-03 1.884e-03 0.567
## MonthlyRate:DepSales 4.719e-03 1.272e-02 0.371
## MonthlyRate:JRMD -4.697e-03 1.931e-02 -0.243
## MonthlyRate:JRManager 2.319e-02 3.227e-02 0.719
## MonthlyRate:JRRD -3.992e-03 3.226e-02 -0.124
## MonthlyRate:TotalWorkingYears -3.853e-03 1.236e-03 -3.118
## PerformanceRating:BTRare 5.521e+01 2.301e+02 0.240
## PerformanceRating:YearsSinceLastPromotion -1.604e+01 3.940e+01 -0.407
## PerformanceRating:DepSales -3.064e+02 2.735e+02 -1.120
## PerformanceRating:JRMD 1.053e+02 3.861e+02 0.273
## PerformanceRating:JRManager -5.558e+02 6.594e+02 -0.843
## PerformanceRating:JRRD -3.377e+02 7.157e+02 -0.472
## PerformanceRating:TotalWorkingYears 2.785e-01 2.686e+01 0.010
## BTRare:YearsSinceLastPromotion 1.053e+01 2.944e+01 0.358
## BTRare:DepSales -3.285e+01 1.980e+02 -0.166
## BTRare:JRMD -1.524e+02 3.057e+02 -0.498
## BTRare:JRManager -1.043e+02 5.415e+02 -0.193
## BTRare:JRRD -1.977e+02 4.685e+02 -0.422
## BTRare:TotalWorkingYears 4.321e+00 1.898e+01 0.228
## YearsSinceLastPromotion:DepSales -5.521e+00 3.238e+01 -0.171
## YearsSinceLastPromotion:JRMD -5.975e+01 4.762e+01 -1.255
## YearsSinceLastPromotion:JRManager -5.040e+00 5.422e+01 -0.093
## YearsSinceLastPromotion:JRRD -6.130e+01 6.383e+01 -0.960
## YearsSinceLastPromotion:TotalWorkingYears -3.452e+00 2.072e+00 -1.666
## DepSales:JRMD NA NA NA
## DepSales:JRManager -1.678e+03 4.535e+02 -3.701
## DepSales:JRRD NA NA NA
## DepSales:TotalWorkingYears 2.159e+01 2.014e+01 1.072
## JRMD:JRManager NA NA NA
## JRMD:JRRD NA NA NA
## JRMD:TotalWorkingYears 4.421e+01 2.830e+01 1.562
## JRManager:JRRD NA NA NA
## JRManager:TotalWorkingYears -6.519e+00 3.642e+01 -0.179
## JRRD:TotalWorkingYears -1.721e+01 3.429e+01 -0.502
## Pr(>|t|)
## (Intercept) 0.485908
## JobLevel 0.035184 *
## MonthlyRate 0.718598
## PerformanceRating 0.404064
## BTRare 0.957168
## YearsSinceLastPromotion 0.966118
## DepSales 0.714138
## JRMD 0.517133
## JRManager 0.015068 *
## JRRD 0.018975 *
## TotalWorkingYears 0.464828
## JobLevel:MonthlyRate 0.202446
## JobLevel:PerformanceRating 0.351539
## JobLevel:BTRare 0.928972
## JobLevel:YearsSinceLastPromotion 0.047240 *
## JobLevel:DepSales 0.040371 *
## JobLevel:JRMD 0.164304
## JobLevel:JRManager 0.592538
## JobLevel:JRRD 0.844618
## JobLevel:TotalWorkingYears 0.072816 .
## MonthlyRate:PerformanceRating 0.726933
## MonthlyRate:BTRare 0.699729
## MonthlyRate:YearsSinceLastPromotion 0.570875
## MonthlyRate:DepSales 0.710742
## MonthlyRate:JRMD 0.807937
## MonthlyRate:JRManager 0.472596
## MonthlyRate:JRRD 0.901531
## MonthlyRate:TotalWorkingYears 0.001886 **
## PerformanceRating:BTRare 0.810458
## PerformanceRating:YearsSinceLastPromotion 0.683960
## PerformanceRating:DepSales 0.262959
## PerformanceRating:JRMD 0.785207
## PerformanceRating:JRManager 0.399524
## PerformanceRating:JRRD 0.637152
## PerformanceRating:TotalWorkingYears 0.991730
## BTRare:YearsSinceLastPromotion 0.720688
## BTRare:DepSales 0.868299
## BTRare:JRMD 0.618334
## BTRare:JRManager 0.847276
## BTRare:JRRD 0.673202
## BTRare:TotalWorkingYears 0.820022
## YearsSinceLastPromotion:DepSales 0.864653
## YearsSinceLastPromotion:JRMD 0.209899
## YearsSinceLastPromotion:JRManager 0.925959
## YearsSinceLastPromotion:JRRD 0.337156
## YearsSinceLastPromotion:TotalWorkingYears 0.096025 .
## DepSales:JRMD NA
## DepSales:JRManager 0.000229 ***
## DepSales:JRRD NA
## DepSales:TotalWorkingYears 0.284234
## JRMD:JRManager NA
## JRMD:JRRD NA
## JRMD:TotalWorkingYears 0.118583
## JRManager:JRRD NA
## JRManager:TotalWorkingYears 0.858004
## JRRD:TotalWorkingYears 0.615798
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1047 on 819 degrees of freedom
## Multiple R-squared: 0.9511, Adjusted R-squared: 0.9481
## F-statistic: 318.7 on 50 and 819 DF, p-value: < 2.2e-16
Now we wants to run the model again with just the statistical significant p-values: JobLevel, JRManager, JRRD, JobLevel:YearsSinceLastPromotion, JobLevel:DepSales, JobLevel:TotalWorkingYears, MonthlyRate:TotalWorkingYears,YearsSinceLastPromotion:TotalWorkingYears, DepSales:JRManager
Results: RSME of 1067
lmsalary_model4=lm(MonthlyIncome~JobLevel+JRManager+JRRD++JobLevel:TotalWorkingYears+MonthlyRate:TotalWorkingYears+YearsSinceLastPromotion:TotalWorkingYears+DepSales:JRManager+YearsSinceLastPromotion:BTRare,data=lm_salarydf)
summary(lmsalary_model4)
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + JRManager + JRRD + +JobLevel:TotalWorkingYears +
## MonthlyRate:TotalWorkingYears + YearsSinceLastPromotion:TotalWorkingYears +
## DepSales:JRManager + YearsSinceLastPromotion:BTRare, data = lm_salarydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3785.1 -606.9 -106.4 663.1 4341.1
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -2.224e+02 1.130e+02 -1.968
## JobLevel 2.816e+03 7.950e+01 35.424
## JRManager 3.585e+03 2.415e+02 14.847
## JRRD 3.701e+03 1.898e+02 19.501
## JobLevel:TotalWorkingYears 1.830e+01 3.204e+00 5.711
## TotalWorkingYears:MonthlyRate -7.918e-04 3.420e-04 -2.315
## TotalWorkingYears:YearsSinceLastPromotion -7.530e-01 7.943e-01 -0.948
## JRManager:DepSales -3.999e+02 3.053e+02 -1.310
## YearsSinceLastPromotion:BTRare 5.293e+01 1.745e+01 3.034
## Pr(>|t|)
## (Intercept) 0.04935 *
## JobLevel < 2e-16 ***
## JRManager < 2e-16 ***
## JRRD < 2e-16 ***
## JobLevel:TotalWorkingYears 1.55e-08 ***
## TotalWorkingYears:MonthlyRate 0.02083 *
## TotalWorkingYears:YearsSinceLastPromotion 0.34340
## JRManager:DepSales 0.19059
## YearsSinceLastPromotion:BTRare 0.00249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1063 on 861 degrees of freedom
## Multiple R-squared: 0.947, Adjusted R-squared: 0.9465
## F-statistic: 1924 on 8 and 861 DF, p-value: < 2.2e-16
Our last model we want is to do a full interaction on our dataset Results: RSME is 981
lmsalary_model5=lm(MonthlyIncome~(.)^2,data=lm_salarydf)
For my final model: I took the variables that were significant in our all interaction and added it to our data, and found that these were the ones to give us the best RSME = 1063.
lmsalary_model6=lm(MonthlyIncome~JobLevel+JRManager+JRRD++JobLevel:TotalWorkingYears+MonthlyRate:TotalWorkingYears+YearsSinceLastPromotion:TotalWorkingYears+DepSales:JRManager+YearsSinceLastPromotion:BTRare+EnvironmentSatisfaction:WorkLifeBalance,data=lm_salarydf)
summary(lmsalary_model6)
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + JRManager + JRRD + +JobLevel:TotalWorkingYears +
## MonthlyRate:TotalWorkingYears + YearsSinceLastPromotion:TotalWorkingYears +
## DepSales:JRManager + YearsSinceLastPromotion:BTRare + EnvironmentSatisfaction:WorkLifeBalance,
## data = lm_salarydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3839.0 -609.7 -87.1 658.6 4323.8
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.434e+02 1.321e+02 -1.086
## JobLevel 2.818e+03 7.950e+01 35.447
## JRManager 3.584e+03 2.414e+02 14.844
## JRRD 3.690e+03 1.900e+02 19.424
## JobLevel:TotalWorkingYears 1.830e+01 3.203e+00 5.713
## TotalWorkingYears:MonthlyRate -7.857e-04 3.419e-04 -2.298
## TotalWorkingYears:YearsSinceLastPromotion -7.581e-01 7.942e-01 -0.955
## JRManager:DepSales -4.216e+02 3.058e+02 -1.378
## YearsSinceLastPromotion:BTRare 5.341e+01 1.745e+01 3.061
## EnvironmentSatisfaction:WorkLifeBalance -1.100e+01 9.540e+00 -1.153
## Pr(>|t|)
## (Intercept) 0.27797
## JobLevel < 2e-16 ***
## JRManager < 2e-16 ***
## JRRD < 2e-16 ***
## JobLevel:TotalWorkingYears 1.53e-08 ***
## TotalWorkingYears:MonthlyRate 0.02182 *
## TotalWorkingYears:YearsSinceLastPromotion 0.34008
## JRManager:DepSales 0.16843
## YearsSinceLastPromotion:BTRare 0.00227 **
## EnvironmentSatisfaction:WorkLifeBalance 0.24939
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1063 on 860 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9465
## F-statistic: 1711 on 9 and 860 DF, p-value: < 2.2e-16
Forward Selection Model
We used a forward selection model for all the statistically signficiant values without interactions, then ran it again with interactions for all the statistically signifidcant.
train1 = employeeData %>% select(c("Age","DailyRate","DistanceFromHome", "EnvironmentSatisfaction", "HourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "OverTime", "PercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "YearsAtCompany", "YearsInCurrentRole","YearsSinceLastPromotion", "YearsWithCurrManager", "JobRole", "BusinessTravel","Department","EducationField","Gender","JobRole","StockOptionLevel","TotalWorkingYears","TrainingTimesLastYear","WorkLifeBalance","Attrition"))
fit1 = lm(MonthlyIncome~.,data=train1)
fit2 = lm(MonthlyIncome~(JobLevel + JobRole + TotalWorkingYears + BusinessTravel + Gender + DailyRate + MonthlyRate + YearsWithCurrManager + YearsSinceLastPromotion + DistanceFromHome + PerformanceRating + PercentSalaryHike + Department)^2,data=train1)
fit3 = lm(MonthlyIncome~+ JobLevel + JobLevel:JobRole + JobLevel:BusinessTravel + JobLevel:Gender + JobLevel:PerformanceRating + JobLevel:Department + JobRole:TotalWorkingYears + JobRole:Department + DailyRate + DistanceFromHome + YearsWithCurrManager + PerformanceRating + TotalWorkingYears + MonthlyRate + PercentSalaryHike + Gender + YearsSinceLastPromotion + BusinessTravel + Department + TotalWorkingYears:MonthlyRate + JobRole + YearsSinceLastPromotion:DistanceFromHome + DailyRate:DistanceFromHome + JobRole:PercentSalaryHike + TotalWorkingYears:BusinessTravel + TotalWorkingYears:YearsSinceLastPromotion + JobLevel:YearsSinceLastPromotion + DailyRate:PerformanceRating + JobLevel:MonthlyRate + JobRole:Gender + TotalWorkingYears:DailyRate + YearsSinceLastPromotion:PerformanceRating + DailyRate:MonthlyRate + BusinessTravel:PercentSalaryHike + DailyRate:PercentSalaryHike,data=train1)
summary(fit3)
##
## Call:
## lm(formula = MonthlyIncome ~ +JobLevel + JobLevel:JobRole + JobLevel:BusinessTravel +
## JobLevel:Gender + JobLevel:PerformanceRating + JobLevel:Department +
## JobRole:TotalWorkingYears + JobRole:Department + DailyRate +
## DistanceFromHome + YearsWithCurrManager + PerformanceRating +
## TotalWorkingYears + MonthlyRate + PercentSalaryHike + Gender +
## YearsSinceLastPromotion + BusinessTravel + Department + TotalWorkingYears:MonthlyRate +
## JobRole + YearsSinceLastPromotion:DistanceFromHome + DailyRate:DistanceFromHome +
## JobRole:PercentSalaryHike + TotalWorkingYears:BusinessTravel +
## TotalWorkingYears:YearsSinceLastPromotion + JobLevel:YearsSinceLastPromotion +
## DailyRate:PerformanceRating + JobLevel:MonthlyRate + JobRole:Gender +
## TotalWorkingYears:DailyRate + YearsSinceLastPromotion:PerformanceRating +
## DailyRate:MonthlyRate + BusinessTravel:PercentSalaryHike +
## DailyRate:PercentSalaryHike, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2871.3 -609.6 -82.1 536.7 3990.7
##
## Coefficients: (16 not defined because of singularities)
## Estimate
## (Intercept) 4.212e+03
## JobLevel 1.125e+03
## DailyRate -1.446e+00
## DistanceFromHome -1.975e+01
## YearsWithCurrManager -1.668e+01
## PerformanceRating -8.341e+02
## TotalWorkingYears 1.809e+02
## MonthlyRate -2.390e-03
## PercentSalaryHike 1.662e+02
## GenderMale -5.995e+01
## YearsSinceLastPromotion 1.793e+02
## BusinessTravelTravel_Frequently 1.201e+03
## BusinessTravelTravel_Rarely 8.172e+02
## DepartmentResearch & Development -5.200e+03
## DepartmentSales -5.381e+03
## JobRoleHuman Resources -2.973e+03
## JobRoleLaboratory Technician 2.970e+03
## JobRoleManager 5.622e+03
## JobRoleManufacturing Director -4.055e+02
## JobRoleResearch Director 5.669e+03
## JobRoleResearch Scientist 2.438e+03
## JobRoleSales Executive 1.303e+03
## JobRoleSales Representative 2.818e+03
## JobLevel:JobRoleHuman Resources 1.126e+03
## JobLevel:JobRoleLaboratory Technician -1.369e+03
## JobLevel:JobRoleManager 2.489e+01
## JobLevel:JobRoleManufacturing Director 2.411e+02
## JobLevel:JobRoleResearch Director 4.501e+01
## JobLevel:JobRoleResearch Scientist -4.400e+02
## JobLevel:JobRoleSales Executive 2.749e+02
## JobLevel:JobRoleSales Representative -7.996e+02
## JobLevel:BusinessTravelTravel_Frequently 2.411e+02
## JobLevel:BusinessTravelTravel_Rarely 1.130e+02
## JobLevel:GenderMale 1.807e+01
## JobLevel:PerformanceRating 8.632e+01
## JobLevel:DepartmentResearch & Development 1.157e+03
## JobLevel:DepartmentSales 1.092e+03
## JobRoleHuman Resources:TotalWorkingYears -7.849e+01
## JobRoleLaboratory Technician:TotalWorkingYears -4.032e+01
## JobRoleManager:TotalWorkingYears -1.217e+01
## JobRoleManufacturing Director:TotalWorkingYears 1.797e+01
## JobRoleResearch Director:TotalWorkingYears -3.436e+01
## JobRoleResearch Scientist:TotalWorkingYears -5.840e+01
## JobRoleSales Executive:TotalWorkingYears -1.224e-01
## JobRoleSales Representative:TotalWorkingYears -3.676e+01
## JobRoleHuman Resources:DepartmentResearch & Development NA
## JobRoleLaboratory Technician:DepartmentResearch & Development NA
## JobRoleManager:DepartmentResearch & Development NA
## JobRoleManufacturing Director:DepartmentResearch & Development NA
## JobRoleResearch Director:DepartmentResearch & Development NA
## JobRoleResearch Scientist:DepartmentResearch & Development NA
## JobRoleSales Executive:DepartmentResearch & Development NA
## JobRoleSales Representative:DepartmentResearch & Development NA
## JobRoleHuman Resources:DepartmentSales NA
## JobRoleLaboratory Technician:DepartmentSales NA
## JobRoleManager:DepartmentSales NA
## JobRoleManufacturing Director:DepartmentSales NA
## JobRoleResearch Director:DepartmentSales NA
## JobRoleResearch Scientist:DepartmentSales NA
## JobRoleSales Executive:DepartmentSales NA
## JobRoleSales Representative:DepartmentSales NA
## TotalWorkingYears:MonthlyRate -3.083e-03
## DistanceFromHome:YearsSinceLastPromotion -2.072e+00
## DailyRate:DistanceFromHome 2.417e-02
## JobRoleHuman Resources:PercentSalaryHike -1.161e+02
## JobRoleLaboratory Technician:PercentSalaryHike -7.485e+01
## JobRoleManager:PercentSalaryHike -1.120e+02
## JobRoleManufacturing Director:PercentSalaryHike -3.940e+01
## JobRoleResearch Director:PercentSalaryHike -8.998e+01
## JobRoleResearch Scientist:PercentSalaryHike -9.996e+01
## JobRoleSales Executive:PercentSalaryHike -1.118e+02
## JobRoleSales Representative:PercentSalaryHike -1.073e+02
## BusinessTravelTravel_Frequently:TotalWorkingYears -6.151e+01
## BusinessTravelTravel_Rarely:TotalWorkingYears -3.264e+01
## TotalWorkingYears:YearsSinceLastPromotion -5.172e+00
## JobLevel:YearsSinceLastPromotion 2.889e+01
## PerformanceRating:DailyRate 6.181e-01
## JobLevel:MonthlyRate 1.023e-02
## JobRoleHuman Resources:GenderMale 7.485e+01
## JobRoleLaboratory Technician:GenderMale -5.864e+01
## JobRoleManager:GenderMale -1.833e+02
## JobRoleManufacturing Director:GenderMale 7.371e+02
## JobRoleResearch Director:GenderMale -1.328e+02
## JobRoleResearch Scientist:GenderMale 5.964e+01
## JobRoleSales Executive:GenderMale 2.202e+02
## JobRoleSales Representative:GenderMale 4.940e+01
## TotalWorkingYears:DailyRate -1.580e-02
## PerformanceRating:YearsSinceLastPromotion -4.257e+01
## DailyRate:MonthlyRate 1.409e-05
## BusinessTravelTravel_Frequently:PercentSalaryHike -5.677e+01
## BusinessTravelTravel_Rarely:PercentSalaryHike -2.621e+01
## DailyRate:PercentSalaryHike -4.098e-02
## Std. Error
## (Intercept) 4.354e+03
## JobLevel 1.020e+03
## DailyRate 9.157e-01
## DistanceFromHome 1.013e+01
## YearsWithCurrManager 1.228e+01
## PerformanceRating 4.308e+02
## TotalWorkingYears 3.654e+01
## MonthlyRate 1.453e-02
## PercentSalaryHike 5.496e+01
## GenderMale 3.884e+02
## YearsSinceLastPromotion 1.168e+02
## BusinessTravelTravel_Frequently 6.671e+02
## BusinessTravelTravel_Rarely 5.911e+02
## DepartmentResearch & Development 4.167e+03
## DepartmentSales 4.221e+03
## JobRoleHuman Resources 4.341e+03
## JobRoleLaboratory Technician 8.494e+02
## JobRoleManager 1.666e+03
## JobRoleManufacturing Director 9.313e+02
## JobRoleResearch Director 1.248e+03
## JobRoleResearch Scientist 8.553e+02
## JobRoleSales Executive 2.267e+03
## JobRoleSales Representative 2.382e+03
## JobLevel:JobRoleHuman Resources 9.892e+02
## JobLevel:JobRoleLaboratory Technician 3.091e+02
## JobLevel:JobRoleManager 4.075e+02
## JobLevel:JobRoleManufacturing Director 3.186e+02
## JobLevel:JobRoleResearch Director 3.193e+02
## JobLevel:JobRoleResearch Scientist 3.247e+02
## JobLevel:JobRoleSales Executive 5.589e+02
## JobLevel:JobRoleSales Representative 8.314e+02
## JobLevel:BusinessTravelTravel_Frequently 2.305e+02
## JobLevel:BusinessTravelTravel_Rarely 2.032e+02
## JobLevel:GenderMale 1.273e+02
## JobLevel:PerformanceRating 1.399e+02
## JobLevel:DepartmentResearch & Development 8.859e+02
## JobLevel:DepartmentSales 9.014e+02
## JobRoleHuman Resources:TotalWorkingYears 7.256e+01
## JobRoleLaboratory Technician:TotalWorkingYears 2.930e+01
## JobRoleManager:TotalWorkingYears 3.517e+01
## JobRoleManufacturing Director:TotalWorkingYears 3.017e+01
## JobRoleResearch Director:TotalWorkingYears 3.289e+01
## JobRoleResearch Scientist:TotalWorkingYears 2.898e+01
## JobRoleSales Executive:TotalWorkingYears 2.648e+01
## JobRoleSales Representative:TotalWorkingYears 3.820e+01
## JobRoleHuman Resources:DepartmentResearch & Development NA
## JobRoleLaboratory Technician:DepartmentResearch & Development NA
## JobRoleManager:DepartmentResearch & Development NA
## JobRoleManufacturing Director:DepartmentResearch & Development NA
## JobRoleResearch Director:DepartmentResearch & Development NA
## JobRoleResearch Scientist:DepartmentResearch & Development NA
## JobRoleSales Executive:DepartmentResearch & Development NA
## JobRoleSales Representative:DepartmentResearch & Development NA
## JobRoleHuman Resources:DepartmentSales NA
## JobRoleLaboratory Technician:DepartmentSales NA
## JobRoleManager:DepartmentSales NA
## JobRoleManufacturing Director:DepartmentSales NA
## JobRoleResearch Director:DepartmentSales NA
## JobRoleResearch Scientist:DepartmentSales NA
## JobRoleSales Executive:DepartmentSales NA
## JobRoleSales Representative:DepartmentSales NA
## TotalWorkingYears:MonthlyRate 1.054e-03
## DistanceFromHome:YearsSinceLastPromotion 1.350e+00
## DailyRate:DistanceFromHome 1.053e-02
## JobRoleHuman Resources:PercentSalaryHike 6.287e+01
## JobRoleLaboratory Technician:PercentSalaryHike 3.964e+01
## JobRoleManager:PercentSalaryHike 5.526e+01
## JobRoleManufacturing Director:PercentSalaryHike 4.210e+01
## JobRoleResearch Director:PercentSalaryHike 5.387e+01
## JobRoleResearch Scientist:PercentSalaryHike 3.933e+01
## JobRoleSales Executive:PercentSalaryHike 3.686e+01
## JobRoleSales Representative:PercentSalaryHike 5.184e+01
## BusinessTravelTravel_Frequently:TotalWorkingYears 2.858e+01
## BusinessTravelTravel_Rarely:TotalWorkingYears 2.505e+01
## TotalWorkingYears:YearsSinceLastPromotion 1.925e+00
## JobLevel:YearsSinceLastPromotion 1.424e+01
## PerformanceRating:DailyRate 3.844e-01
## JobLevel:MonthlyRate 7.377e-03
## JobRoleHuman Resources:GenderMale 4.918e+02
## JobRoleLaboratory Technician:GenderMale 3.248e+02
## JobRoleManager:GenderMale 4.411e+02
## JobRoleManufacturing Director:GenderMale 3.138e+02
## JobRoleResearch Director:GenderMale 4.197e+02
## JobRoleResearch Scientist:GenderMale 3.214e+02
## JobRoleSales Executive:GenderMale 2.745e+02
## JobRoleSales Representative:GenderMale 4.010e+02
## TotalWorkingYears:DailyRate 1.167e-02
## PerformanceRating:YearsSinceLastPromotion 3.475e+01
## DailyRate:MonthlyRate 1.215e-05
## BusinessTravelTravel_Frequently:PercentSalaryHike 3.714e+01
## BusinessTravelTravel_Rarely:PercentSalaryHike 3.240e+01
## DailyRate:PercentSalaryHike 3.676e-02
## t value Pr(>|t|)
## (Intercept) 0.967 0.333613
## JobLevel 1.103 0.270318
## DailyRate -1.580 0.114593
## DistanceFromHome -1.950 0.051579
## YearsWithCurrManager -1.359 0.174575
## PerformanceRating -1.936 0.053172
## TotalWorkingYears 4.950 9.05e-07
## MonthlyRate -0.164 0.869407
## PercentSalaryHike 3.023 0.002581
## GenderMale -0.154 0.877384
## YearsSinceLastPromotion 1.535 0.125217
## BusinessTravelTravel_Frequently 1.801 0.072132
## BusinessTravelTravel_Rarely 1.382 0.167225
## DepartmentResearch & Development -1.248 0.212482
## DepartmentSales -1.275 0.202735
## JobRoleHuman Resources -0.685 0.493632
## JobRoleLaboratory Technician 3.496 0.000498
## JobRoleManager 3.375 0.000773
## JobRoleManufacturing Director -0.435 0.663395
## JobRoleResearch Director 4.544 6.38e-06
## JobRoleResearch Scientist 2.850 0.004484
## JobRoleSales Executive 0.575 0.565668
## JobRoleSales Representative 1.183 0.237193
## JobLevel:JobRoleHuman Resources 1.138 0.255394
## JobLevel:JobRoleLaboratory Technician -4.429 1.08e-05
## JobLevel:JobRoleManager 0.061 0.951305
## JobLevel:JobRoleManufacturing Director 0.757 0.449310
## JobLevel:JobRoleResearch Director 0.141 0.887941
## JobLevel:JobRoleResearch Scientist -1.355 0.175840
## JobLevel:JobRoleSales Executive 0.492 0.622948
## JobLevel:JobRoleSales Representative -0.962 0.336472
## JobLevel:BusinessTravelTravel_Frequently 1.046 0.295988
## JobLevel:BusinessTravelTravel_Rarely 0.556 0.578188
## JobLevel:GenderMale 0.142 0.887177
## JobLevel:PerformanceRating 0.617 0.537302
## JobLevel:DepartmentResearch & Development 1.306 0.191958
## JobLevel:DepartmentSales 1.212 0.226033
## JobRoleHuman Resources:TotalWorkingYears -1.082 0.279730
## JobRoleLaboratory Technician:TotalWorkingYears -1.376 0.169200
## JobRoleManager:TotalWorkingYears -0.346 0.729389
## JobRoleManufacturing Director:TotalWorkingYears 0.596 0.551672
## JobRoleResearch Director:TotalWorkingYears -1.045 0.296466
## JobRoleResearch Scientist:TotalWorkingYears -2.015 0.044250
## JobRoleSales Executive:TotalWorkingYears -0.005 0.996312
## JobRoleSales Representative:TotalWorkingYears -0.962 0.336163
## JobRoleHuman Resources:DepartmentResearch & Development NA NA
## JobRoleLaboratory Technician:DepartmentResearch & Development NA NA
## JobRoleManager:DepartmentResearch & Development NA NA
## JobRoleManufacturing Director:DepartmentResearch & Development NA NA
## JobRoleResearch Director:DepartmentResearch & Development NA NA
## JobRoleResearch Scientist:DepartmentResearch & Development NA NA
## JobRoleSales Executive:DepartmentResearch & Development NA NA
## JobRoleSales Representative:DepartmentResearch & Development NA NA
## JobRoleHuman Resources:DepartmentSales NA NA
## JobRoleLaboratory Technician:DepartmentSales NA NA
## JobRoleManager:DepartmentSales NA NA
## JobRoleManufacturing Director:DepartmentSales NA NA
## JobRoleResearch Director:DepartmentSales NA NA
## JobRoleResearch Scientist:DepartmentSales NA NA
## JobRoleSales Executive:DepartmentSales NA NA
## JobRoleSales Representative:DepartmentSales NA NA
## TotalWorkingYears:MonthlyRate -2.925 0.003546
## DistanceFromHome:YearsSinceLastPromotion -1.535 0.125278
## DailyRate:DistanceFromHome 2.296 0.021945
## JobRoleHuman Resources:PercentSalaryHike -1.847 0.065072
## JobRoleLaboratory Technician:PercentSalaryHike -1.888 0.059401
## JobRoleManager:PercentSalaryHike -2.027 0.042984
## JobRoleManufacturing Director:PercentSalaryHike -0.936 0.349682
## JobRoleResearch Director:PercentSalaryHike -1.670 0.095242
## JobRoleResearch Scientist:PercentSalaryHike -2.541 0.011232
## JobRoleSales Executive:PercentSalaryHike -3.034 0.002489
## JobRoleSales Representative:PercentSalaryHike -2.070 0.038748
## BusinessTravelTravel_Frequently:TotalWorkingYears -2.152 0.031680
## BusinessTravelTravel_Rarely:TotalWorkingYears -1.303 0.192991
## TotalWorkingYears:YearsSinceLastPromotion -2.686 0.007376
## JobLevel:YearsSinceLastPromotion 2.029 0.042763
## PerformanceRating:DailyRate 1.608 0.108259
## JobLevel:MonthlyRate 1.387 0.165785
## JobRoleHuman Resources:GenderMale 0.152 0.879056
## JobRoleLaboratory Technician:GenderMale -0.181 0.856783
## JobRoleManager:GenderMale -0.416 0.677856
## JobRoleManufacturing Director:GenderMale 2.349 0.019080
## JobRoleResearch Director:GenderMale -0.316 0.751754
## JobRoleResearch Scientist:GenderMale 0.186 0.852812
## JobRoleSales Executive:GenderMale 0.802 0.422586
## JobRoleSales Representative:GenderMale 0.123 0.902001
## TotalWorkingYears:DailyRate -1.354 0.176180
## PerformanceRating:YearsSinceLastPromotion -1.225 0.220874
## DailyRate:MonthlyRate 1.159 0.246603
## BusinessTravelTravel_Frequently:PercentSalaryHike -1.529 0.126772
## BusinessTravelTravel_Rarely:PercentSalaryHike -0.809 0.418727
## DailyRate:PercentSalaryHike -1.115 0.265265
##
## (Intercept)
## JobLevel
## DailyRate
## DistanceFromHome .
## YearsWithCurrManager
## PerformanceRating .
## TotalWorkingYears ***
## MonthlyRate
## PercentSalaryHike **
## GenderMale
## YearsSinceLastPromotion
## BusinessTravelTravel_Frequently .
## BusinessTravelTravel_Rarely
## DepartmentResearch & Development
## DepartmentSales
## JobRoleHuman Resources
## JobRoleLaboratory Technician ***
## JobRoleManager ***
## JobRoleManufacturing Director
## JobRoleResearch Director ***
## JobRoleResearch Scientist **
## JobRoleSales Executive
## JobRoleSales Representative
## JobLevel:JobRoleHuman Resources
## JobLevel:JobRoleLaboratory Technician ***
## JobLevel:JobRoleManager
## JobLevel:JobRoleManufacturing Director
## JobLevel:JobRoleResearch Director
## JobLevel:JobRoleResearch Scientist
## JobLevel:JobRoleSales Executive
## JobLevel:JobRoleSales Representative
## JobLevel:BusinessTravelTravel_Frequently
## JobLevel:BusinessTravelTravel_Rarely
## JobLevel:GenderMale
## JobLevel:PerformanceRating
## JobLevel:DepartmentResearch & Development
## JobLevel:DepartmentSales
## JobRoleHuman Resources:TotalWorkingYears
## JobRoleLaboratory Technician:TotalWorkingYears
## JobRoleManager:TotalWorkingYears
## JobRoleManufacturing Director:TotalWorkingYears
## JobRoleResearch Director:TotalWorkingYears
## JobRoleResearch Scientist:TotalWorkingYears *
## JobRoleSales Executive:TotalWorkingYears
## JobRoleSales Representative:TotalWorkingYears
## JobRoleHuman Resources:DepartmentResearch & Development
## JobRoleLaboratory Technician:DepartmentResearch & Development
## JobRoleManager:DepartmentResearch & Development
## JobRoleManufacturing Director:DepartmentResearch & Development
## JobRoleResearch Director:DepartmentResearch & Development
## JobRoleResearch Scientist:DepartmentResearch & Development
## JobRoleSales Executive:DepartmentResearch & Development
## JobRoleSales Representative:DepartmentResearch & Development
## JobRoleHuman Resources:DepartmentSales
## JobRoleLaboratory Technician:DepartmentSales
## JobRoleManager:DepartmentSales
## JobRoleManufacturing Director:DepartmentSales
## JobRoleResearch Director:DepartmentSales
## JobRoleResearch Scientist:DepartmentSales
## JobRoleSales Executive:DepartmentSales
## JobRoleSales Representative:DepartmentSales
## TotalWorkingYears:MonthlyRate **
## DistanceFromHome:YearsSinceLastPromotion
## DailyRate:DistanceFromHome *
## JobRoleHuman Resources:PercentSalaryHike .
## JobRoleLaboratory Technician:PercentSalaryHike .
## JobRoleManager:PercentSalaryHike *
## JobRoleManufacturing Director:PercentSalaryHike
## JobRoleResearch Director:PercentSalaryHike .
## JobRoleResearch Scientist:PercentSalaryHike *
## JobRoleSales Executive:PercentSalaryHike **
## JobRoleSales Representative:PercentSalaryHike *
## BusinessTravelTravel_Frequently:TotalWorkingYears *
## BusinessTravelTravel_Rarely:TotalWorkingYears
## TotalWorkingYears:YearsSinceLastPromotion **
## JobLevel:YearsSinceLastPromotion *
## PerformanceRating:DailyRate
## JobLevel:MonthlyRate
## JobRoleHuman Resources:GenderMale
## JobRoleLaboratory Technician:GenderMale
## JobRoleManager:GenderMale
## JobRoleManufacturing Director:GenderMale *
## JobRoleResearch Director:GenderMale
## JobRoleResearch Scientist:GenderMale
## JobRoleSales Executive:GenderMale
## JobRoleSales Representative:GenderMale
## TotalWorkingYears:DailyRate
## PerformanceRating:YearsSinceLastPromotion
## DailyRate:MonthlyRate
## BusinessTravelTravel_Frequently:PercentSalaryHike
## BusinessTravelTravel_Rarely:PercentSalaryHike
## DailyRate:PercentSalaryHike
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 970.6 on 794 degrees of freedom
## Multiple R-squared: 0.9593, Adjusted R-squared: 0.9554
## F-statistic: 249.4 on 75 and 794 DF, p-value: < 2.2e-16
###Conclusion of Salary Prediction Model The Multiple Linear Regression with Interaction terms is the best model with the best balance RSME of 1063.
The Best Predictors are JobLevel, JobRole, TotalWorkingYears, Business Travel, Gender, Daily Rate, Monthly RateYeraswithCurrManager, Years Since Last Promo, Distance From Home, Performance Rating Percent Salary Hike, and Department.
Ending Conclusion
Attrition Model: Naive Bayes
+ Naive Bayes with selected Predictors was better than the KNN model.
+ The Predictor Variables we use for Naives Makes Sense in how they were used in our Models to give us the best Accuracy, Sensitivity, and Specificity.
+ We might have some errors due to my own hand selection of models by putting on one variable at a time.
+ The best top three predictors of Attrition are JobLevel, Monthly Income, and Overtime
Salary Model: Multiple Linear Regression
+ Multiple Linear Regression using the selected Predictors from the interactions and overall regression provided statistical values that makes sense.
+ Interaction terms created powerful p-values we can use for our model.
+ The best level predictor of incomes are Job Level, Total Working Years, and Job Roles were our top three salary prediction predictors.