Predict Contract Renewal Rate in R

      No Comments on Predict Contract Renewal Rate in R

Contract renewal and cancellation are essential aspects of any service-based business. For businesses to retain customers, they need to ensure that the customers are satisfied with the service they receive. This satisfaction is what leads to contract renewal, which means a continued revenue stream for the business. On the other hand, customers who are dissatisfied with the service they receive may opt to cancel their contracts, leading to a loss of revenue for the business. This action is also commonly referred to as customer churn. Predict contract renewal in R is an attempt to make service contract cancellation predictions based on prior complaints.

R can be used for predicting contract renewal rates. R is a popular programming language and environment for statistical computing and graphics, and has a wide range of built-in and external libraries for data analysis and modeling.

To predict contract renewal rate in R or any statistical packages, models such as logistic regression, decision trees, or random forests, among others can be used. These models can be trained using historical data on contract renewals and related variables such as customer demographics, usage patterns, and satisfaction levels. Once the model is trained, it can be used to predict the probability of contract renewal for new customers based on their characteristics.

Example Dataset to Predict Contract Renewal Rate in R

To understand contract renewal and cancellation better, lets look at a dataset that contains information on customer complaints, incidents, feedback score, and service contract details. We will then use this data to predict customer cancellation based on prior complaints.

The Dataset

The dataset we will be using contains the following columns:

customer_id: The unique ID of each customer

contract_type: The type of service contract (Basic, Standard, or Premium)

incidents: The number of incidents reported by each customer

complaints: The number of complaints filed by each customer

complaint_date: The date when the complaint was filed

time_to_resolve: The number of days it took to resolve the complaint

feedback_score: The customer’s feedback score on the service received

cancelled: A binary variable indicating whether the customer cancelled their contract (0 = not cancelled, 1 = cancelled)

The following lines in R will create a dataframe with 20 rows of sample service data of customers.

Running the above lines of code, the contents of service_data frame will look like:

Lets do few data exploration

To understand the data better, we will perform some basic data analysis. First, let’s look at the distribution of the cancelled contracts:

Distribution-of-Cancelled-Contracts-Invest-Solver
Bar Plot of Cancelled Contracts

From the plot, we can see that the majority of the contracts were not cancelled.

Next, let’s look at the relationship between the number of incidents and the feedback score:

Feedback-Scores-Incidents-Invest-Solver
Scatter plot – Number of Incidents and Feedback Score

From the plot, we can see that there is no clear relationship between the number of incidents and the feedback score. You would tend to think fewer incidents will result in higher feedback score. This is clearly not the case here. Having a weak correlation between these two variables will make our prediction model more generalized.

Now, let’s look at the relationship between the time to resolve complaints and the feedback score:

Time-To-Resolve-And-Feedback-Scores-Invest-Solver
Scatter Plot of Time to Resolve and Feedback Scores

Similar to the previous plot, we can see that there is a no definitive relationship between the time to resolve complaints and feedback scores. When the time to resolve takes 2 days or less, the assumption is that the customer will give higher scores. Only during two instances, the customer gave a score of five which is the highest possible score on service calls which took two days or less to complete. This pattern is not consistent across all service calls where the customer gave a 1 when time of resolve took only two days to complete. Clearly there no indication of a strong pattern here.

Predicting Customer Cancellation

The fun begins here. Now, let’s use the dataset to predict customer cancellation based on prior complaints. We will first preprocess the data by converting the Complaint and Cancelled variables into factors and imputing missing values with the median.

Here is a short explanation of the above nine lines of code:

  1. service_data$Cancelled <- factor(service_data$cancelled): This line of code is converting the Cancelled variable into a categorical variable or factor. This is because cancelled is also a binary variable, which takes the values 0 or 1, and is used as the outcome variable in the logistic regression model.
  2. service_data$time_to_resolve[is.na(service_data$time_to_resolve)]<-median(service_data$time_to_resolve, na.rm = TRUE): This line of code is imputing missing values in the time_to_resolve variable with the median of the non-missing values in the same variable. The is.na() function identifies the missing values in the time_to_resolve variable, and the median of the non-missing values is calculated using the median() function with the na.rm = TRUE argument to ignore any missing values in the calculation. The [] notation is used to subset the time_to_resolve variable, and the imputed values are assigned to the missing values.
  3. service_data$feedback_score[is.na(service_data$feedback_score)] <- median(service_date$feedback_store, na.rm = TRUE): This line of code is doing the same thing as the previous line, but for the feedback_score variable. It is imputing the missing values in feedback_store with the median of the non-missing values in the same variable, using the same approach as above.

    Overall, these lines of code are preparing the dataset for the logistic regression model by converting the categorical variables into factors and imputing missing values in the numeric variables with the median of the non-missing values.

    Next, we will train a logistic regression model on the training set using the complaints, incidents, time_to_resolve, and feedback_score variables as predictors:

    The summary of the logistic regression model is:

    From the coefficients, we can see that the complaints and time_to_resolve variables are significant predictors of customer cancellation, as their p-values are less than 0.05. A p-value less than 0.05 suggests that there is less than a 5% chance of obtaining the service cancellations by chance alone, assuming the null hypothesis is true.

    Now, let’s use the trained model to predict customer cancellation on the test set:

    The two lines of code deserve special attention. They are there to convert the predicted probabilities from logistic regression model to binary predictions of cancellation (0 or 1). Here’s how it works:

    pred[pred >= 0.5] <- 1:

    This line of code replaces all values of pred that are greater than or equal to 0.5 with a value of 1. In logistic regression, the predicted probabilities are usually interpreted as the probability of the positive outcome (i.e., cancellation in this case). So, if the predicted probability of cancellation is greater than or equal to 0.5, we can assume that the model is predicting a cancellation. Therefore, we set those predictions to 1.

    pred[pred < 0.5] <- 0:

    This line of code replaces all values of pred that are less than 0.5 with a value of 0. In logistic regression, the predicted probabilities that are less than 0.5 are usually interpreted as the probability of the negative outcome (i.e., non-cancellation in this case). So, if the predicted probability of cancellation is less than 0.5, we can assume that the model is predicting a non-cancellation. Therefore, we set those predictions to 0.

    After applying these two lines of code, all the predicted probabilities have been converted to binary predictions of cancellation (0 or 1), which can be used to evaluate the performance of the model.

    The accuracy of the logistic regression model on the test set is:

    This indicates that the model is able to predict customer cancellation with an accuracy of 75% based on prior complaints.

    Model Performance through Confusion Matrix

    After making the predictions on the test set, we can calculate the confusion matrix to evaluate the performance of the model. The confusion matrix is a table that shows the number of true positives, true negatives, false positives, and false negatives of the predictions.

    The confusion matrix of the logistic regression model on the test set is:

    From the confusion matrix, we can see that the model correctly predicted 5 true negatives and 4 true positives, while making 2 false positives and 1 false negatives.

    1. True negatives (TN): the model correctly predicted 5 non-cancellations.
    2. True positives (TP): the model correctly predicted 4 cancellations.
    3. False positives (FP): the model incorrectly predicted 2 cancellations when the actual outcome was non-cancellation.
    4. False negatives (FN): the model incorrectly predicted 1 non-cancellation when the actual outcome was cancellation.

    The accuracy of the model is 0.75, which means that 75% of the predictions were correct. The sensitivity of the model is 0.7143, which means that it correctly predicted 71.43% of the cancellations. The specificity of the model is 0.80, which means that it correctly predicted 80% of the non-cancellations.

    The positive predictive value (PPV) of the model is 0.833, which means 83.3% of the positive predictions (i.e., predicted cancellations) were correct. The negative predictive value (NPV) of the model is 0.667, which means that 66.7% of the negative predictions (i.e., predicted non-cancellations) were correct.

    In summary, the logistic regression model performed OK on the test set, with fair amount accuracy, sensitivity, and specificity, and a good PPV. However, it should be noted that the dataset is relatively small (42 records) and may not be representative of the larger population of customers, and therefore, the results should be interpreted with a cautious set of eyes.

    Conclusion

    In this post, I have covered contract renewal and cancellation and how to predict customer cancellation on service contract based on prior complaints using R. I used a dataset containing customer, service contract, incidents, complaint, date of complaint, time to resolve, feedback score, and an outcome variable which has value 0 or 1 called Cancelled, and performed some basic data analysis to understand the data better. We then trained a logistic regression model on the data and used it to predict customer cancellation on a test set, achieving an accuracy of 75%. The Predict Contract Renewal Rate in R model that was developed can be extended to predict service cancellations using more sophisticated algorithms such as XGBoost and Neural Networks. I will cover them in details on a future post with a real world dataset.

    Download Predict Contract Renewal in R

    Donate with PayPal button

    Leave a Reply

    This site uses Akismet to reduce spam. Learn how your comment data is processed.