Contract renewal and cancellation are essential aspects of any service-based business. For businesses to retain customers, they need to ensure that the customers are satisfied with the service they receive. This satisfaction is what leads to contract renewal, which means a continued revenue stream for the business. On the other hand, customers who are dissatisfied with the service they receive may opt to cancel their contracts, leading to a loss of revenue for the business. This action is also commonly referred to as **customer churn**. Predict contract renewal in R is an attempt to make service contract cancellation predictions based on prior complaints.

R can be used for predicting contract renewal rates. R is a popular programming language and environment for statistical computing and graphics, and has a wide range of built-in and external libraries for data analysis and modeling.

To predict contract renewal rate in R or any statistical packages, models such as logistic regression, decision trees, or random forests, among others can be used. These models can be trained using historical data on contract renewals and related variables such as customer demographics, usage patterns, and satisfaction levels. Once the model is trained, it can be used to predict the probability of contract renewal for new customers based on their characteristics.

## Example Dataset to Predict Contract Renewal Rate in R

To understand contract renewal and cancellation better, lets look at a dataset that contains information on customer complaints, incidents, feedback score, and service contract details. We will then use this data to predict customer cancellation based on prior complaints.

**The Dataset**

The dataset we will be using contains the following columns:

: The unique ID of each customer**customer_id**

: The type of service contract (Basic, Standard, or Premium)**contract_type**

: The number of incidents reported by each customer**incidents**

: The number of complaints filed by each customer**complaints**

: The date when the complaint was filed**complaint_date**

: The number of days it took to resolve the complaint**time_to_resolve**

: The customer’s feedback score on the service received**feedback_score**

: A binary variable indicating whether the customer cancelled their contract (0 = not cancelled, 1 = cancelled)**cancelled**

The following lines in R will create a dataframe with 20 rows of sample service data of customers.

Running the above lines of code, the contents of service_data frame will look like:

1 2 3 4 5 6 7 8 9 10 11 |
customer_id contract_type incidents complaints complaint_date time_to_resolve feedback_score cancelled 1 Premium 8 3 2021-03-01 6 4 0 2 Premium 2 4 2021-01-16 2 4 1 3 Premium 7 4 2021-01-06 7 1 1 4 Standard 9 2 2021-03-13 8 3 0 5 Premium 6 5 2021-03-27 2 3 0 6 Standard 9 0 2021-03-27 6 1 1 7 Standard 8 1 2021-02-08 2 3 1 8 Standard 2 4 2021-01-31 6 5 0 9 Premium 3 4 2021-03-22 5 2 0 10 Basic 0 3 2021-02-19 9 3 1 |

## Lets do few data exploration

To understand the data better, we will perform some basic data analysis. First, let’s look at the distribution of the cancelled contracts:

1 2 3 4 5 |
ggplot(service_data, aes(x = cancelled)) + geom_bar(fill = c("steelblue","red"), width = 0.2) + scale_x_continuous(breaks=seq(0,1,1)) + labs(title = "Distribution of Cancelled Contracts", x = "Cancelled", y = "Count") + theme(plot.title=element_text(size=10), axis.text = element_text(size = 8)) |

From the plot, we can see that the majority of the contracts were not cancelled.

Next, let’s look at the relationship between the number of incidents and the feedback score:

1 2 3 4 5 |
ggplot(service_data, aes(x = incidents, y = feedback_score)) + geom_jitter(alpha = 0.5, size = 3, color="darkred") + scale_x_continuous(breaks=seq(0, 10, 2)) + labs(title = "Relationship Between Incidents and Feedback Score", x = "Incidents", y = "Feedback Score") |

From the plot, we can see that there is no clear relationship between the number of incidents and the feedback score. You would tend to think fewer incidents will result in higher feedback score. This is clearly not the case here. Having a weak correlation between these two variables will make our prediction model more generalized.

Now, let’s look at the relationship between the time to resolve complaints and the feedback score:

1 2 3 4 5 |
ggplot(service_data, aes(x = time_to_resolve, y = feedback_score)) + geom_point(alpha = 0.5, size = 3, color="red") + scale_x_continuous(breaks=seq(0, 10, 2)) + labs(title = "Relationship Between Time to Resolve and Feedback Score", x = "Time to Resolve (days)", y = "Feedback Score") |

Similar to the previous plot, we can see that there is a no definitive relationship between the time to resolve complaints and feedback scores. When the time to resolve takes 2 days or less, the assumption is that the customer will give higher scores. Only during two instances, the customer gave a score of five which is the highest possible score on service calls which took two days or less to complete. This pattern is not consistent across all service calls where the customer gave a 1 when time of resolve took only two days to complete. Clearly there no indication of a strong pattern here.

## Predicting Customer Cancellation

The fun begins here. Now, let’s use the dataset to predict customer cancellation based on prior complaints. We will first preprocess the data by converting the Complaint and Cancelled variables into factors and imputing missing values with the median.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Preprocessing data service_data$Cancelled <- factor(service_data$cancelled) service_data$time_to_resolve[is.na(service_data$time_to_resolve)] <- median(service_data$time_to_resolve, na.rm = TRUE) service_data$feedback_score[is.na(service_data$feedback_score)] <- median(service_data$feedback_score, na.rm = TRUE) # Splitting data into training and testing sets set.seed(123) split <- sample.split(service_data$cancelled, SplitRatio = 0.7) train <- subset(service_data, split == TRUE) test <- subset(service_data, split == FALSE) |

Here is a short explanation of the above nine lines of code:

`service_data$Cancelled <- factor(service_data$cancelled)`

: This line of code is converting the`Cancelled`

variable into a categorical variable or factor. This is because`cancelled`

is also a binary variable, which takes the values 0 or 1, and is used as the outcome variable in the logistic regression model.`service_data$time_to_resolve[is.na(service_data$time_to_resolve)]<-median(service_data$time_to_resolve, na.rm = TRUE)`

: This line of code is imputing missing values in the`time_to_resolve`

variable with the median of the non-missing values in the same variable. The`is.na()`

function identifies the missing values in the

variable, and the median of the non-missing values is calculated using the`time_to_resolve`

`median()`

function with the`na.rm = TRUE`

argument to ignore any missing values in the calculation. The`[]`

notation is used to subset the time_to_resolve variable, and the imputed values are assigned to the missing values.`service_data$feedback_score[is.na(service_data$feedback_score)] <- median(service_date$feedback_store, na.rm = TRUE)`

: This line of code is doing the same thing as the previous line, but for the`feedback_score`

variable. It is imputing the missing values in feedback_store with the median of the non-missing values in the same variable, using the same approach as above.

Overall, these lines of code are preparing the dataset for the logistic regression model by converting the categorical variables into factors and imputing missing values in the numeric variables with the median of the non-missing values.

Next, we will train a logistic regression model on the training set using the complaints, incidents, time_to_resolve, and feedback_score variables as predictors:

1 2 3 4 5 |
# Training logistic regression model model <- glm(Cancelled ~ time_to_resolve + complaints + feedback_score, data = train, family = "binomial") summary(model) |

The summary of the logistic regression model is:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Call: glm(formula = Cancelled ~ time_to_resolve + complaints + feedback_score, family = "binomial", data = train) Deviance Residuals: Min 1Q Median 3Q Max -1.3737 -0.6490 -0.1764 0.4844 1.8096 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 13.9165 6.5995 2.109 0.0350 * time_to_resolve -1.0317 0.5039 -2.047 0.0406 * complaints -0.5941 0.3499 -1.698 0.0895 . feedback_score -2.3091 1.0438 -2.212 0.0270 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 41.054 on 29 degrees of freedom Residual deviance: 25.396 on 26 degrees of freedom AIC: 33.396 Number of Fisher Scoring iterations: 6 |

From the coefficients, we can see that the complaints and time_to_resolve variables are significant predictors of customer cancellation, as their p-values are less than 0.05. A *p-value* less than 0.05 suggests that there is less than a 5% chance of obtaining the service cancellations by chance alone, assuming the null hypothesis is true.

Now, let’s use the trained model to predict customer cancellation on the test set:

1 2 3 4 5 6 7 8 9 10 |
# Predicting cancellation on test set pred <- predict(model, newdata = test, type = "response") # Converting probabilities to binary predictions pred[pred >= 0.5] <- 1 pred[pred < 0.5] <- 0 # Calculating accuracy of predictions accuracy <- sum(pred == test$Cancelled) / nrow(test) print(paste0("Accuracy: ", accuracy)) |

The two lines of code deserve special attention. They are there to convert the predicted probabilities from logistic regression model to binary predictions of cancellation (0 or 1). Here’s how it works:

`pred[pred >= 0.5] <- 1`

:

This line of code replaces all values of `pred`

that are greater than or equal to 0.5 with a value of 1. In logistic regression, the predicted probabilities are usually interpreted as the probability of the positive outcome (i.e., cancellation in this case). So, if the predicted probability of cancellation is greater than or equal to 0.5, we can assume that the model is predicting a cancellation. Therefore, we set those predictions to 1.

`pred[pred < 0.5] <- 0`

:

This line of code replaces all values of `pred`

that are less than 0.5 with a value of 0. In logistic regression, the predicted probabilities that are less than 0.5 are usually interpreted as the probability of the negative outcome (i.e., non-cancellation in this case). So, if the predicted probability of cancellation is less than 0.5, we can assume that the model is predicting a non-cancellation. Therefore, we set those predictions to 0.

After applying these two lines of code, all the predicted probabilities have been converted to binary predictions of cancellation (0 or 1), which can be used to evaluate the performance of the model.

The accuracy of the logistic regression model on the test set is:

1 |
[1] "Accuracy: 0.75" |

This indicates that the model is able to predict customer cancellation with an accuracy of 75% based on prior complaints.

## Model Performance through Confusion Matrix

After making the predictions on the test set, we can calculate the **confusion matrix** to evaluate the performance of the model. The confusion matrix is a table that shows the number of true positives, true negatives, false positives, and false negatives of the predictions.

1 2 |
library(caret) confusionMatrix(factor(pred), test$Cancelled) |

The confusion matrix of the logistic regression model on the test set is:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Confusion Matrix and Statistics Reference Prediction 0 1 0 5 1 1 2 4 Accuracy : 0.75 95% CI : (0.4281, 0.9451) No Information Rate : 0.5833 P-Value [Acc > NIR] : 0.1916 Kappa : 0.5 Mcnemar's Test P-Value : 1.0000 Sensitivity : 0.7143 Specificity : 0.8000 Pos Pred Value : 0.8333 Neg Pred Value : 0.6667 Prevalence : 0.5833 Detection Rate : 0.4167 Detection Prevalence : 0.5000 Balanced Accuracy : 0.7571 |

From the confusion matrix, we can see that the model correctly predicted 5 true negatives and 4 true positives, while making 2 false positives and 1 false negatives.

- True negatives (TN): the model correctly predicted 5 non-cancellations.
- True positives (TP): the model correctly predicted 4 cancellations.
- False positives (FP): the model incorrectly predicted 2 cancellations when the actual outcome was non-cancellation.
- False negatives (FN): the model incorrectly predicted 1 non-cancellation when the actual outcome was cancellation.

The accuracy of the model is 0.75, which means that 75% of the predictions were correct. The sensitivity of the model is 0.7143, which means that it correctly predicted 71.43% of the cancellations. The specificity of the model is 0.80, which means that it correctly predicted 80% of the non-cancellations.

The positive predictive value (PPV) of the model is 0.833, which means 83.3% of the positive predictions (i.e., predicted cancellations) were correct. The negative predictive value (NPV) of the model is 0.667, which means that 66.7% of the negative predictions (i.e., predicted non-cancellations) were correct.

In summary, the logistic regression model performed OK on the test set, with fair amount accuracy, sensitivity, and specificity, and a good PPV. However, it should be noted that the dataset is relatively small (42 records) and may not be representative of the larger population of customers, and therefore, the results should be interpreted with a cautious set of eyes.

## Conclusion

In this post, I have covered contract renewal and cancellation and how to predict customer cancellation on service contract based on prior complaints using R. I used a dataset containing customer, service contract, incidents, complaint, date of complaint, time to resolve, feedback score, and an outcome variable which has value 0 or 1 called Cancelled, and performed some basic data analysis to understand the data better. We then trained a logistic regression model on the data and used it to predict customer cancellation on a test set, achieving an accuracy of 75%. The Predict Contract Renewal Rate in R model that was developed can be extended to predict service cancellations using more sophisticated algorithms such as **XGBoost** and **Neural Networks**. I will cover them in details on a future post with a real world dataset.

**Download Predict Contract Renewal in R**