Introduction In today’s bustling hospitality landscape, managing hotel reservations efficiently is crucial for maximizing profits and ensuring guest satisfaction. However, the persistent challenge of reservation cancellations remains a headache for hotel owners, impacting their bottom line and complicating planning efforts. To address this issue head-on, our project aims to leverage data-driven methods to predict hotel reservation cancellations accurately. With various factors influencing booking decisions, from guest demographics to reservation specifics, understanding these dynamics can be both complex and rewarding. Our project delves into the “Hotel Demand” dataset to shed light on the factors driving demand in the hotel industry and the methodologies employed in modelling this phenomenon. Through our analysis, we seek to uncover patterns and correlations within the dataset to answer key questions such as what factors impact hotel demand and how various approaches contribute to model development. By exploring variables such as Average Daily Rate (ADR), booking channel, and lead time, we aim to discern the nuances of demand dynamics. Additionally, we’ll investigate factors like previous booking behaviour, deposit types, and special requests to gain insights into guest preferences and behaviours. Through this investigation, our aim is not only to address current queries but also to discover fresh insights that can improve our understanding of hotel demand and enhance predictive modelling techniques. By leveraging data-driven approaches, we strive to provide valuable insights that empower hoteliers to better manage reservations, optimize revenue generation, and enhance the guest experience.
Literature Review
The project “Using API with Logistic Regression Model to Predict Hotel Reservation Cancellation by Detecting the Cancellation Factors” aims to develop a predictive model for hotel reservation cancellations by leveraging insights from existing literature on hotel cancellation policies, predictive modelling techniques, and logistic regression. By synthesizing key insights from studies investigating factors influencing cancellations and predictive modelling approaches, the project seeks to provide a comprehensive solution utilizing data mining techniques. Through the integration of logistic regression models into an API, hotel management systems will gain a predictive tool to assess the likelihood of reservation cancellations based on various parameters. This approach aligns with the broader goal of enhancing revenue management strategies, optimizing resource allocation, and improving customer service in the hospitality industry. By leveraging insights from diverse sources and disciplines, the project endeavours to offer a robust framework for decision-making, ultimately enhancing customer satisfaction and operational efficiency for hoteliers.
The study “An Analysis of Hotel Booking Cancellations and Factors Affecting Revenue Generation” by Aditya Dole explores the intricacies of cancelled hotel reservations in both City and Resort Hotel segments, aiming to identify causes and provide strategic guidance for revenue enhancement. Through a comprehensive literature review, previous research indicates the significant impact of factors such as pricing, location, and personal reasons on cancellation rates. Methodologically, the study adopts a systematic approach to data analysis, employing techniques like data pre-processing and advanced imputation methods. The research hypotheses focus on untangling factors influencing cancellations and orienting pricing decisions for hotel operators. Insights gleaned from the analysis underscore the importance of factors like booking lead time, client demographics, and accommodation types in shaping cancellation rates, suggesting tailored pricing strategies and customer-centric policies to mitigate revenue loss. Ultimately, the study highlights the critical role of data-driven decision-making in addressing challenges posed by booking cancellations and suggests future research directions to further advance understanding in this area. These findings closely align with our project’s goal of optimizing revenue management strategies in the hospitality sector, offering valuable insights that resonate with our objectives. Integrating insights from this research into our project framework will enable us to develop more robust predictive models and implement tailored pricing strategies, ultimately enhancing revenue generation and operational efficiency for hotels.
The research outlined in “Performance Comparison of Classification Algorithms in Hotel Booking Cancellation Prediction” directly correlates with our project’s focus on optimizing revenue management strategies in the hospitality sector. By evaluating various classification algorithms’ performance in predicting booking cancellations, the study underscores the critical role of accurate prediction in revenue management effectiveness. The findings, particularly the success of kNN and RF algorithms in achieving high accuracy rates, offer valuable insights into potential approaches for enhancing our project’s predictive models. Moreover, the exploration of dataset attributes and the impact of applying PCA on performance aligns with our project’s emphasis on leveraging diverse data sources and advanced techniques to improve prediction accuracy. The study evaluates six classification algorithms on a dataset spanning 2015-2017, with k-Nearest Neighbors (kNN) and Random Forest (RF) achieving the highest accuracy rates of 85%. With 119,390 reservation records and 30 attributes, including average daily room return and booking changes, the study employs cross-validation and Principal Component Analysis (PCA) to assess algorithm performance. Results align with our project’s focus on revenue management optimization in the hospitality sector, offering insights for enhancing predictive models and leveraging diverse data sources. The study’s emphasis on accuracy metrics, comparative analysis, and practical implications resonates with our project’s commitment to rigorous evaluation and real-world application. By leveraging insights from this research, our project aims to advance revenue management practices, driving improved profitability and operational efficiency for hotels.
The article titled “Identifying Critical Hotel Cancellations Using Artificial Intelligence” explores the application of artificial intelligence (AI) techniques to forecast hotel booking cancellations, offering significant opportunities for revenue optimization and management in the hospitality sector. Leveraging Personal Name Records (PNR) data, the study achieves promising results, with an impressive 80% accuracy in predicting cancellations made seven days in advance. This research addresses a crucial gap in existing literature by focusing on individual cancellations that have a substantial impact on revenue loss, particularly those occurring close to the time of service. The methodology employed simplifies data preparation and minimizes reliance on external data sources, enhancing its practical applicability for hotel managers. Practical implications of the study include the optimization of booking management systems and cancellation policies based on accurate forecasts, ultimately leading to improved revenue generation and enhanced customer satisfaction. Overall, this research underscores the potential of AI in addressing critical challenges in the hotel industry and emphasizes the importance of proactive revenue management strategies. This study aligns closely with our project on “Hotel Demand,” as it highlights the significance of accurately forecasting cancellations, which is essential for optimizing hotel operations and maximizing revenue. By leveraging AI techniques and PNR data, similar to the approach outlined in the article, our project aims to develop predictive models that can anticipate fluctuations in hotel demand. The methodologies and findings presented in this literature review provide valuable insights that will guide our project’s efforts to enhance revenue management strategies and improve customer satisfaction in the hospitality sector.
The study “Big Data in Hotel Revenue Management: Exploring Cancellation Drivers to Gain Insights Into Booking Cancellation Behavior” directly aligns with our project objectives, which aim to optimize revenue management strategies in the hospitality industry. By addressing the challenge of booking cancellations through predictive analytics, the research offers valuable insights that resonate with our goals of improving demand forecasting accuracy and operational efficiency. Implementing similar machine learning techniques and leveraging diverse data sources will empower us to proactively identify and manage cancellations, thereby enhancing revenue generation and customer satisfaction. Additionally, the emphasis on dynamic cancellation policies underscores the importance of flexibility in adapting to changing booking patterns, a key aspect that we aim to integrate into our revenue management framework. Overall, this study provides a solid foundation and practical insights that inform and enrich our project’s approach to optimizing revenue management practices in the dynamic landscape of the hospitality sector.
References
• Almotiri, S. H., Alosaimi, N., & Abdullah, B. (2021). Using API with Logistic Regression Model to Predict Hotel Reservation Cancellation by Detecting the Cancellation Factors. International Journal of Advanced Computer Science & Applications (Online), 12(6). https://doi.org/10.14569/ijacsa.2021.0120688 • Dole, A. (2023). An analysis of hotel booking cancellations and factors affecting revenue generation. International Journal for Research in Applied Science and Engineering Technology, 11(10), 2141–2148. https://doi.org/10.22214/ijraset.2023.56407 • Tekin, M., & Gök, M. (2021, April 30). Performance comparison of classification algorithms in hotel booking cancellation prediction. https://dergipark.org.tr/en/pub/aita/issue/70741/1137735 • Mejia, C., Medina, A. J. S., & Pellejero, M. (2020). Identifying critical hotel cancellations using artificial intelligence. Tourism Management Perspectives (Print), 35, 100718. https://doi.org/10.1016/j.tmp.2020.100718 • António, N., De Almeida, A., & Nunes, L. (2019). Big Data in Hotel Revenue Management: Exploring cancellation drivers to gain insights into booking cancellation behavior. Cornell Hospitality Quarterly, 60(4), 298–319. https://doi.org/10.1177/1938965519851466
library(readr)
df = read.csv("Dataset 20-hotel_bookings.csv", stringsAsFactors = TRUE)
str(df)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ country : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ agent : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
## $ company : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ reservation_status_date : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
head(df)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
# Replacing NULL values with most recurring values in the dataset.
cleaned.df = df
ourtable = table(cleaned.df$agent)
posmode = which.max(ourtable)
ourmode = names(posmode)
cleaned.df$agent[cleaned.df$agent == "NULL"] = ourmode
# Replacing NULL values with mode recurring values in the variable called "company"
ourmodeco = names(sort(table(cleaned.df$company), decreasing = TRUE))[2]
cleaned.df$company[cleaned.df$company == "NULL"] = ourmodeco
# Binary Variables:
# Getting to know the data
# Percentage of No cancellations and cancellations:
(showups = sum(cleaned.df$is_canceled == 0)/length(cleaned.df$is_canceled) * 100)
## [1] 62.95837
(noshowups = sum(cleaned.df$is_canceled == 1)/ length(cleaned.df$is_canceled) * 100)
## [1] 37.04163
# Visualization
pie_data <- c("Not Cancelled - 62.96%" = showups, "Cancelled - 37.04%" = noshowups)
pie(pie_data, main = "Cancellation Distribution - Guests who shows up and does not show up", col = c("lightblue", "lightgreen"))
# There are 62.96% people show up at the hotel without cancelling the booking and 37.04% guests cancel the bookings.
# Percentage of if the customer is a repeated guest or not:
(repeatedguest = sum(cleaned.df$is_repeated_guest == 1)/length(cleaned.df$is_repeated_guest) * 100)
## [1] 3.191222
(notrepeatedguest = sum(cleaned.df$is_repeated_guest == 0)/length(cleaned.df$is_repeated_guest) * 100)
## [1] 96.80878
# 96.81% guests were first time visitors and not repeated. Only 3.19% guests were repeated customers that speaks a lot about the services about the hotel. Why there are less or no repeated visitors?
# Visualization
pie_data_RG <- c("First time visitors - 96.81%" = notrepeatedguest, "Repeated Guests - 3.19%" = repeatedguest)
pie(pie_data_RG, main = "Distribution for guests if they are first time visitors or not", col = c("lightblue", "lightgreen"))
# Numerical Variables
# ADR
# Getting to know the variables, summary statistics:
summary(cleaned.df$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 69.29 94.58 101.83 126.00 5400.00
# ADR is Calculated by dividing the sum of all lodging transactions by the total number of staying nights. Basically average rate per night.
# Visualization
boxplot(summary(cleaned.df$adr),ylim = c(0, 6000), main = "Statistics summary for ADR", xlab = "Stats", ylab = "ADR")
# There is one outlier - for one observation, a guest has ADR of 5400 dollars. There is no enough data to support that one observation, we might decide to drop that particular row in the future.
boxplot(summary(cleaned.df$adr),ylim = c(0, 200), main = "Statistics summary for ADR", xlab = "Stats", ylab = "ADR")
# On an average, the rate for one night at the hotel/resort is little below 100 dollars.
# Lead time
# Summary Statistics
# Number of days that elapsed between the entering date of the booking into the PMS and the arrival date. It is basically the time from booking date till the guest is checking-in the hotel.
summary(cleaned.df$lead_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 18 69 104 160 737
# Visualization
boxplot(summary(cleaned.df$lead_time), main = "Statistics summary for Lead Time", xlab = "Stats", ylab = "Lead Time")
# There are outliers in the data set.
# Categorical Variables
# Summary statistics
# Counting the frequency of each agent through whom the booking were made.
table(cleaned.df$agent)
##
## 1 10 103 104 105 106 107 11 110 111 112 114 115
## 7191 260 21 53 14 2 2 395 12 16 15 1 225
## 117 118 119 12 121 122 126 127 128 129 13 132 133
## 1 69 304 578 37 2 14 45 23 14 82 143 56
## 134 135 138 139 14 141 142 143 144 146 147 148 149
## 482 2 287 8 3640 6 137 172 1 124 156 4 28
## 15 150 151 152 153 154 155 156 157 158 159 16 162
## 402 5 56 183 25 193 94 190 61 1 89 246 37
## 163 165 167 168 17 170 171 173 174 175 177 179 180
## 7 1 3 184 241 93 607 29 22 195 347 2 4
## 181 182 183 184 185 187 19 191 192 193 195 196 197
## 59 8 45 52 78 24 1061 198 41 15 193 301 1
## 2 20 201 205 208 21 210 211 213 214 215 216 219
## 162 540 42 27 173 875 7 2 1 5 15 1 13
## 22 220 223 227 229 23 232 234 235 236 24 240 241
## 382 104 18 2 786 25 2 128 29 247 22 13922 1721
## 242 243 244 245 247 248 249 25 250 251 252 253 254
## 780 514 4 37 1 131 51 3 2870 220 29 87 29
## 256 257 258 26 261 262 265 267 269 27 270 273 275
## 24 24 3 401 38 22 1 1 2 450 6 349 8
## 276 278 28 280 281 282 283 285 286 287 288 289 29
## 8 1 1666 1 82 2 2 1 45 8 14 1 683
## 290 291 294 295 296 298 299 3 30 300 301 302 303
## 19 1 1 4 42 472 1 1336 484 1 1 3 2
## 304 305 306 307 308 31 310 313 314 315 32 321 323
## 1 45 35 14 54 162 25 36 927 284 15 3 25
## 324 325 326 327 328 33 330 331 332 333 334 335 336
## 9 6 165 20 9 31 125 2 55 1 28 4 23
## 337 339 34 341 344 346 348 35 350 352 354 355 358
## 1 77 294 4 8 1 22 109 28 1 14 4 1
## 359 36 360 363 364 367 368 37 370 371 375 378 38
## 21 100 15 6 19 1 45 1230 3 4 40 36 274
## 384 385 387 388 39 390 391 393 394 397 4 40 403
## 2 60 32 1 127 57 2 13 33 1 47 1039 4
## 404 405 406 408 41 410 411 414 416 418 42 420 423
## 2 5 1 1 75 133 16 2 1 8 211 3 19
## 425 426 427 429 430 431 432 433 434 436 438 44 440
## 16 3 3 5 4 1 1 1 33 49 2 292 56
## 441 444 446 449 45 450 451 453 454 455 459 461 464
## 7 1 1 2 32 1 1 1 2 19 16 2 98
## 467 468 469 47 472 474 475 476 479 480 481 483 484
## 39 49 2 50 1 17 8 2 32 1 8 1 11
## 492 493 495 497 5 50 502 508 509 510 52 526 527
## 28 35 57 1 330 20 24 6 10 2 137 10 35
## 53 531 535 54 55 56 57 58 59 6 60 61 63
## 18 68 3 1 16 375 28 335 1 3290 19 2 29
## 64 66 67 68 69 7 70 71 72 73 74 75 77
## 23 44 127 211 90 3539 1 73 6 1 20 73 33
## 78 79 8 81 82 83 85 86 87 88 89 9 90
## 37 47 1514 6 77 696 554 338 77 19 99 48301 1
## 91 92 93 94 95 96 98 99 NULL
## 58 7 1 114 135 537 124 68 0
cleaned.df$agent1 <- as.numeric(cleaned.df$agent)
cleaned.df$agent_category <- cut(cleaned.df$agent1, breaks = c(-Inf, 100, 200, 300, 400, Inf), labels = c("0-100", "100-200", "200-300", "300-400", "400+"))
# Visualization
library(ggplot2)
ggplot(cleaned.df, aes(x = agent_category, fill = agent_category)) +
geom_bar() +
labs(title = "Agent Code Frequency Distribution", x = "Agent Category", y = "Frequency") +
theme_minimal()
# Counting the frequency of hotel bookings during each month of all years(2015-2017).
table(cleaned.df$arrival_date_month)
##
## April August December February January July June March
## 11089 13877 6780 8068 5929 12661 10939 9794
## May November October September
## 11791 6794 11160 10508
library(ggplot2)
ggplot(cleaned.df, aes(x = arrival_date_month)) + geom_bar(fill = "lightblue", color = "black") + labs(title = "Months Distribution of the bookings", x = "Months(2015-2017)", y = "Frequency of bookings through agent") + theme_minimal() + scale_fill_brewer(palette = "Set3")
# Counting the frequency of hotel bookings from various companies.
table(cleaned.df$company)
##
## 10 100 101 102 103 104 105 106 107 108 109
## 1 1 1 1 16 1 8 2 9 11 1
## 11 110 112 113 115 116 118 12 120 122 126
## 1 52 13 36 4 6 7 14 14 18 1
## 127 130 132 135 137 139 14 140 142 143 144
## 15 12 1 66 4 3 9 1 1 17 27
## 146 148 149 150 153 154 158 159 16 160 163
## 3 37 5 19 215 133 2 6 5 1 17
## 165 167 168 169 174 178 179 18 180 183 184
## 3 7 2 65 149 27 24 1 5 16 1
## 185 186 192 193 195 197 20 200 202 203 204
## 4 12 4 16 38 47 50 3 38 13 34
## 207 209 210 212 213 215 216 217 218 219 22
## 9 19 2 1 1 8 21 2 43 141 6
## 220 221 222 223 224 225 227 229 230 232 233
## 4 27 2 784 3 7 24 1 3 2 114
## 234 237 238 240 242 243 245 246 250 251 253
## 1 1 33 3 62 2 3 3 2 18 1
## 254 255 257 258 259 260 263 264 268 269 270
## 10 6 1 1 2 3 14 2 14 33 43
## 271 272 273 274 275 277 278 279 28 280 281
## 2 3 1 14 3 5 2 8 5 48 138
## 282 284 286 287 288 289 29 290 291 292 293
## 4 1 21 5 1 2 2 17 12 18 5
## 297 301 302 304 305 307 308 309 31 311 312
## 7 1 5 2 1 36 33 1 17 2 3
## 313 314 316 317 318 319 32 320 321 323 324
## 1 1 2 9 1 3 1 1 2 10 9
## 325 329 330 331 332 333 334 337 338 34 341
## 2 12 4 61 2 11 3 25 12 8 5
## 342 343 346 347 348 349 35 350 351 352 353
## 48 29 14 1 59 2 1 3 2 1 4
## 355 356 357 358 360 361 362 364 365 366 367
## 13 10 5 7 12 2 2 6 29 24 14
## 368 369 37 370 371 372 373 376 377 378 379
## 1 5 10 2 11 3 1 1 5 3 9
## 38 380 382 383 384 385 386 388 39 390 391
## 51 12 5 6 9 30 1 7 8 13 2
## 392 393 394 395 396 397 398 399 40 400 401
## 4 1 6 4 18 15 1 11 113520 2 1
## 402 403 405 407 408 409 410 411 412 413 415
## 1 2 119 22 15 12 5 2 1 1 1
## 416 417 418 419 42 420 421 422 423 424 425
## 1 1 25 1 5 1 9 1 2 24 1
## 426 428 429 43 433 435 436 437 439 442 443
## 4 13 2 29 2 12 2 7 6 1 5
## 444 445 446 447 448 45 450 451 452 454 455
## 5 4 1 2 4 250 10 6 4 1 1
## 456 457 458 459 46 460 461 465 466 47 470
## 2 3 2 5 26 3 1 12 3 72 5
## 477 478 479 48 481 482 483 484 485 486 487
## 23 2 1 5 1 2 2 2 14 2 1
## 489 49 490 491 492 494 496 497 498 499 501
## 1 5 5 2 2 4 1 1 58 1 1
## 504 506 507 51 511 512 513 514 515 516 518
## 11 1 23 99 6 3 2 2 6 1 2
## 52 520 521 523 525 528 53 530 531 534 539
## 2 1 7 19 15 2 8 5 1 2 2
## 54 541 543 59 6 61 62 64 65 67 68
## 1 1 2 7 1 2 47 1 1 267 46
## 71 72 73 76 77 78 8 80 81 82 83
## 2 30 3 1 1 22 1 1 23 14 9
## 84 85 86 88 9 91 92 93 94 96 99
## 3 2 32 22 37 48 13 3 87 1 12
## NULL
## 0
cleaned.df$company1 <- as.numeric(cleaned.df$company)
cleaned.df$company_category <- cut(cleaned.df$company1, breaks = c(-Inf, 92, 94, 96,98, 100, 102, Inf), labels = c("90-92", "93-94", "95-96", "97-98", "99-100", "100-101", "101-102"))
# Visualize compact data
ggplot(cleaned.df, aes(x = company_category, fill = company_category)) +
geom_bar() +
labs(title = "Company code Frequency Distribution", x = "Company Category", y = "Frequency") +
theme_minimal()
# Further analyzing the data
# Deposit Type
table(cleaned.df$deposit_type)
##
## No Deposit Non Refund Refundable
## 104641 14587 162
t <- 104641 + 14587 + 162
(n <- 104641/t * 100)
## [1] 87.64637
(nr <- 14587/t * 100)
## [1] 12.21794
(r <- 162/t * 100)
## [1] 0.1356898
# Load the RColorBrewer package
library(RColorBrewer)
# Define the data
categories <- c("No Deposit - 87.64%", "Non Refund - 12.21%", "Refundable - 0.14%")
values <- c(104641, 14587, 162)
# Create a color palette with three colors
colors <- brewer.pal(3, "Set3")
# Create the pie chart with custom colors
pie(values, labels = categories, main = "Distribution of DepositType", col = colors)
# The majority of bookings (87.64%) were made without any deposit, while a smaller portion (12.21%) involved a non-refundable deposit equal to or exceeding the total stay cost. A negligible percentage (0.14%) of bookings were made with a refundable deposit amounting to less than the total stay cost.
# Integer Variables
# Adults
summary(cleaned.df$adults)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.856 2.000 55.000
plot(cleaned.df$arrival_date_month, cleaned.df$adults, xlab = "Months", ylab = "Number of adults coming at once", las = 2)
# There are few outliers, we cannot remove those, because those outlies altogether form a big chunk of the dataset. It even shows the ability of the hotel to cater the huge number of people arriving in groups at once.
# Children
summary(cleaned.df$children)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1039 0.0000 10.0000 4
plot(cleaned.df$arrival_date_month, cleaned.df$children, xlab = "Months", ylab = "Number of children coming at once", las = 2)
# Stays in week nights
summary(cleaned.df$stays_in_week_nights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 1.0 2.0 2.5 3.0 50.0
boxplot(summary(cleaned.df$stays_in_week_nights), main = "Statistics summary for weekend nights", xlab = "Stats", ylab = "Stays in week nights")
# One outlier, most bookings are for week nights less than 10. A very few booking/s are for more than 45 days, close to 50.
# Stays in weekend nights
summary(cleaned.df$stays_in_weekend_nights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.9276 2.0000 19.0000
boxplot(summary(cleaned.df$stays_in_weekend_nights), main = "Statistics summary for weekend nights", xlab = "Stats", ylab = "Stays in weekend nights")
# One outlier, most bookings are for weekend nights less than 5. A very few booking/s are for more than 15 weekend nights.
# Month Variable
table(cleaned.df$arrival_date_month)
##
## April August December February January July June March
## 11089 13877 6780 8068 5929 12661 10939 9794
## May November October September
## 11791 6794 11160 10508
barplot(sort(table(cleaned.df$arrival_date_month), decreasing = TRUE), main = "Distribution of bookings over various months", ylab = "Distribution", las = 2, col = c("lightblue"))
# July and August shows more number of bookings during the entire span of three years.
# Year
(table(cleaned.df$arrival_date_year))
##
## 2015 2016 2017
## 21996 56707 40687
x <- as.numeric(names(table(cleaned.df$arrival_date_year)))
class(x)
## [1] "numeric"
y <- as.numeric(c("21996", "56707", "4068"))
class(y)
## [1] "numeric"
plot(x, y, type = "o", main = "Distribution of bookings over various years", xlab = "Year", ylab = "Distribution")
# We can see the bookings dipped in the year 2017 and 2016 being the year with most number of hotel bookings.
table(cleaned.df$hotel)
##
## City Hotel Resort Hotel
## 79330 40060
# Percentage of hotels vs resorts:
(hotels = sum(cleaned.df$hotel == "City Hotel")/length(cleaned.df$hotel) * 100)
## [1] 66.4461
(resorts = sum(cleaned.df$hotel == "Resort Hotel")/length(cleaned.df$hotel) * 100)
## [1] 33.5539
# 66.46% guests are visiting city hotel. Whereas 33.55% guests are visiting Resort Hotel.
# Visualization
stayat <- c("Hotel visitors - 66.46%" = hotels, "Resort Visitors - 33.55%" = resorts)
pie(stayat, main = "Distribution for guests if they are arriving at the hotel or at the resort", col = c("lightblue", "lightgreen"))
# In this stage, we have dropped all the un-neccessary categorical values and we have divided the dataset into two parts. The hotel provides two types of stay: Resort and City hotel.
# In the previous phase, we had already removed and replaced NULL and undefined values to get a clean dataset.
### RESORT HOTEL DATASET
# Converting the categorical variable to numeric
cleaned.df$is_canceled <- as.numeric(cleaned.df$is_canceled)
View(cleaned.df)
# Running logistic regression on all 32 variables is time consuming, system doesnt support.
# Filtering the dataset, removing the unnecessary columns
finaldf <- subset(cleaned.df, select = -c(4, 5, 6, 7, 13, 14, 15, 16, 20, 21, 23, 24, 25, 27, 31, 32, 33, 34, 35, 36))
head(finaldf)
## hotel is_canceled lead_time stays_in_weekend_nights
## 1 Resort Hotel 0 342 0
## 2 Resort Hotel 0 737 0
## 3 Resort Hotel 0 7 0
## 4 Resort Hotel 0 13 0
## 5 Resort Hotel 0 14 0
## 6 Resort Hotel 0 14 0
## stays_in_week_nights adults children babies is_repeated_guest
## 1 0 2 0 0 0
## 2 0 2 0 0 0
## 3 1 1 0 0 0
## 4 1 1 0 0 0
## 5 2 2 0 0 0
## 6 2 2 0 0 0
## previous_cancellations previous_bookings_not_canceled booking_changes
## 1 0 0 3
## 2 0 0 4
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## days_in_waiting_list adr required_car_parking_spaces
## 1 0 0 0
## 2 0 0 0
## 3 0 75 0
## 4 0 75 0
## 5 0 98 0
## 6 0 98 0
## total_of_special_requests
## 1 0
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
str(finaldf)
## 'data.frame': 119390 obs. of 16 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : num 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
# Correlation between the categorical variables (0&1)
count = table(finaldf$is_canceled, finaldf$is_repeated_guest, dnn = c("iscancelled", "isrepeatedguest"))
round(prop.table(count, margin = 2), 2)
## isrepeatedguest
## iscancelled 0 1
## 0 0.62 0.86
## 1 0.38 0.14
# Correlation Matrix for final df
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.3.3
head(finaldf)
## hotel is_canceled lead_time stays_in_weekend_nights
## 1 Resort Hotel 0 342 0
## 2 Resort Hotel 0 737 0
## 3 Resort Hotel 0 7 0
## 4 Resort Hotel 0 13 0
## 5 Resort Hotel 0 14 0
## 6 Resort Hotel 0 14 0
## stays_in_week_nights adults children babies is_repeated_guest
## 1 0 2 0 0 0
## 2 0 2 0 0 0
## 3 1 1 0 0 0
## 4 1 1 0 0 0
## 5 2 2 0 0 0
## 6 2 2 0 0 0
## previous_cancellations previous_bookings_not_canceled booking_changes
## 1 0 0 3
## 2 0 0 4
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## days_in_waiting_list adr required_car_parking_spaces
## 1 0 0 0
## 2 0 0 0
## 3 0 75 0
## 4 0 75 0
## 5 0 98 0
## 6 0 98 0
## total_of_special_requests
## 1 0
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
# converting hotel column into a dummy variable
finaldf$hotel = ifelse(finaldf$hotel == "Resort Hotel", 1, 0)
str(finaldf)
## 'data.frame': 119390 obs. of 16 variables:
## $ hotel : num 1 1 1 1 1 1 1 1 1 1 ...
## $ is_canceled : num 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
# Correlation Matrix
ggcorrplot(cor(finaldf), type = "lower", lab = TRUE,
method = "circle", title = "Correlation Matrix Heatmap")
# As we can see in the heatmap of correlation matrix for resort dataset, there is no multi-colinearity in the dataset, which is apparently, a best case scenario.
# Dividing the data into training dataset and testing dataset for Resort df:
# Randomizing the dataset
finaldf.randomized = finaldf[sample(nrow(finaldf), replace = FALSE), ]
# Determine the number of rows in the dataset
nrows = nrow(finaldf.randomized)
# Calculate the number of rows for the training set (80%)
trainsize = round(0.8 * nrows)
# Create the training and testing sets
finaldf.train = finaldf.randomized[1:trainsize, ]
finaldf.test = finaldf.randomized[(trainsize + 1): nrows, ]
#Regression with resort training dataset
mod1 = glm(is_canceled ~ ., data = finaldf.train, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod1)
##
## Call:
## glm(formula = is_canceled ~ ., family = "binomial", data = finaldf.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.412e+00 3.391e-02 -41.640 < 2e-16 ***
## hotel -2.454e-01 1.799e-02 -13.640 < 2e-16 ***
## lead_time 4.780e-03 7.918e-05 60.363 < 2e-16 ***
## stays_in_weekend_nights -5.658e-03 8.863e-03 -0.638 0.523
## stays_in_week_nights 2.313e-02 4.814e-03 4.805 1.54e-06 ***
## adults 1.104e-01 1.652e-02 6.685 2.30e-11 ***
## children 8.048e-02 2.050e-02 3.925 8.67e-05 ***
## babies 9.270e-02 8.930e-02 1.038 0.299
## is_repeated_guest -1.071e+00 9.440e-02 -11.350 < 2e-16 ***
## previous_cancellations 2.939e+00 6.358e-02 46.227 < 2e-16 ***
## previous_bookings_not_canceled -5.755e-01 2.917e-02 -19.729 < 2e-16 ***
## booking_changes -6.409e-01 1.762e-02 -36.363 < 2e-16 ***
## days_in_waiting_list -1.923e-03 4.006e-04 -4.799 1.59e-06 ***
## adr 6.379e-03 1.896e-04 33.642 < 2e-16 ***
## required_car_parking_spaces -4.044e+01 6.850e+01 -0.590 0.555
## total_of_special_requests -7.142e-01 1.156e-02 -61.803 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 125824 on 95510 degrees of freedom
## Residual deviance: 99795 on 95495 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 99827
##
## Number of Fisher Scoring iterations: 14
finaldf.train$is_canceled = as.factor(finaldf.train$is_canceled)
head(finaldf.train)
## hotel is_canceled lead_time stays_in_weekend_nights stays_in_week_nights
## 59829 0 1 482 0 2
## 94226 0 0 4 1 1
## 94020 0 0 25 0 1
## 6126 1 1 101 0 3
## 114184 0 0 124 2 5
## 30682 1 0 0 0 1
## adults children babies is_repeated_guest previous_cancellations
## 59829 2 0 0 0 0
## 94226 1 0 0 1 0
## 94020 2 0 0 0 0
## 6126 2 0 0 0 0
## 114184 2 0 0 0 0
## 30682 1 0 0 0 0
## previous_bookings_not_canceled booking_changes days_in_waiting_list
## 59829 0 0 0
## 94226 8 0 0
## 94020 0 0 0
## 6126 0 0 0
## 114184 0 0 0
## 30682 0 0 0
## adr required_car_parking_spaces total_of_special_requests
## 59829 62.80 0 0
## 94226 65.00 0 1
## 94020 101.00 0 1
## 6126 58.00 0 1
## 114184 136.29 0 0
## 30682 38.00 1 1
# Remove column "previous_bookings_not_canceled" from finaldf.train
finaldf.train <- finaldf.train[, !colnames(finaldf.train) %in% "previous_bookings_not_canceled"]
# Display the modified dataframe
head(finaldf.train)
## hotel is_canceled lead_time stays_in_weekend_nights stays_in_week_nights
## 59829 0 1 482 0 2
## 94226 0 0 4 1 1
## 94020 0 0 25 0 1
## 6126 1 1 101 0 3
## 114184 0 0 124 2 5
## 30682 1 0 0 0 1
## adults children babies is_repeated_guest previous_cancellations
## 59829 2 0 0 0 0
## 94226 1 0 0 1 0
## 94020 2 0 0 0 0
## 6126 2 0 0 0 0
## 114184 2 0 0 0 0
## 30682 1 0 0 0 0
## booking_changes days_in_waiting_list adr required_car_parking_spaces
## 59829 0 0 62.80 0
## 94226 0 0 65.00 0
## 94020 0 0 101.00 0
## 6126 0 0 58.00 0
## 114184 0 0 136.29 0
## 30682 0 0 38.00 1
## total_of_special_requests
## 59829 0
## 94226 1
## 94020 1
## 6126 1
## 114184 0
## 30682 1
mod2 = glm(is_canceled ~ ., data = finaldf.train, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod2)
##
## Call:
## glm(formula = is_canceled ~ ., family = "binomial", data = finaldf.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.442e+00 3.393e-02 -42.504 < 2e-16 ***
## hotel -2.585e-01 1.791e-02 -14.437 < 2e-16 ***
## lead_time 4.919e-03 7.883e-05 62.399 < 2e-16 ***
## stays_in_weekend_nights -6.065e-03 8.818e-03 -0.688 0.492
## stays_in_week_nights 2.333e-02 4.794e-03 4.867 1.13e-06 ***
## adults 1.336e-01 1.662e-02 8.037 9.19e-16 ***
## children 8.241e-02 2.054e-02 4.013 6.00e-05 ***
## babies 1.097e-01 8.869e-02 1.237 0.216
## is_repeated_guest -1.829e+00 8.030e-02 -22.781 < 2e-16 ***
## previous_cancellations 2.074e+00 4.701e-02 44.112 < 2e-16 ***
## booking_changes -6.569e-01 1.769e-02 -37.133 < 2e-16 ***
## days_in_waiting_list -2.035e-03 4.009e-04 -5.077 3.83e-07 ***
## adr 6.331e-03 1.897e-04 33.369 < 2e-16 ***
## required_car_parking_spaces -2.917e+01 3.580e+01 -0.815 0.415
## total_of_special_requests -7.295e-01 1.153e-02 -63.242 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 125824 on 95510 degrees of freedom
## Residual deviance: 100949 on 95496 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 100979
##
## Number of Fisher Scoring iterations: 14
# How well the model is being trained for RESORT DATASET?
predicted.train.prob.resort = predict(mod2, newdata = finaldf.train, type = "response")
levels(factor(finaldf.train$is_canceled))
## [1] "0" "1"
# Use cutoff point of 0.5 to convert the probabilities to class
predicted.train.class.resort = ifelse(predicted.train.prob.resort > 0.5, 1, 0)
#Building the confusion matrix
library(caret)
## Loading required package: lattice
confusionMatrix(as.factor(predicted.train.class.resort),as.factor(finaldf.train$is_canceled))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 52663 17994
## 1 7557 17297
##
## Accuracy : 0.7325
## 95% CI : (0.7297, 0.7353)
## No Information Rate : 0.6305
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3884
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8745
## Specificity : 0.4901
## Pos Pred Value : 0.7453
## Neg Pred Value : 0.6959
## Prevalence : 0.6305
## Detection Rate : 0.5514
## Detection Prevalence : 0.7398
## Balanced Accuracy : 0.6823
##
## 'Positive' Class : 0
##
# Accuracy: The accuracy of the model is 75.2%, indicating that 75.2% of the predictions made by the model are correct.
# 95% Confidence Interval (CI): The CI for the accuracy of the model is between 74.73% and 75.67%. This means that we are 95% confident that the true accuracy of the model falls within this range.
# Sensitivity (True Positive Rate): The sensitivity of the model is 96.31%. This indicates that the model correctly identifies 96.31% of the positive cases.
# Specificity (True Negative Rate): The specificity of the model is 20.11%. This means that the model correctly identifies only 20.11% of the negative cases.
# Interpretation: This model demonstrates very high sensitivity, indicating its strong ability to correctly identify positive cases. However, its specificity is quite low, suggesting that it may have a higher rate of false positives. This could indicate that the model is biased towards predicting positive cases.
# For resort dataset
# Checking the performance of the model on testing dataset.
predicted.test.prob.resort = predict(mod1, newdata = finaldf.test, type = "response")
# First 5 actual and predicted records
data.frame(actual = finaldf.test$is_canceled[1:5], predicted = predicted.test.prob.resort[1:5])
## actual predicted
## 25155 0 0.1028243
## 113435 0 0.3159719
## 88879 0 0.3645734
## 46378 1 0.4697018
## 74978 1 0.9683660
# Use cutoff point of 0.5 to convert probabilities to class
predicted.test.class.resort = ifelse(predicted.test.prob.resort > 0.5, 1, 0)
# Confusion Matrix of the said data
confusionMatrix(as.factor(predicted.test.class.resort), as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13091 4470
## 1 1855 4459
##
## Accuracy : 0.7351
## 95% CI : (0.7294, 0.7407)
## No Information Rate : 0.626
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3988
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8759
## Specificity : 0.4994
## Pos Pred Value : 0.7455
## Neg Pred Value : 0.7062
## Prevalence : 0.6260
## Detection Rate : 0.5483
## Detection Prevalence : 0.7355
## Balanced Accuracy : 0.6876
##
## 'Positive' Class : 0
##
# Accuracy: The accuracy of the model is 74.61%, indicating that 74.61% of the predictions made by the model are correct.
# 95% Confidence Interval (CI): The CI for the accuracy of the model is between 73.65% and 75.56%. This means that we are 95% confident that the true accuracy of the model falls within this range.
# Sensitivity (True Positive Rate): The sensitivity of the model is 95.72%. This indicates that the model correctly identifies 95.72% of the positive cases.
# Specificity (True Negative Rate): The specificity of the model is 20.37%. This means that the model correctly identifies 20.37% of the negative cases.
# Interpretation: This model demonstrates very high sensitivity, indicating its strong ability to correctly identify positive cases. However, similar to previous models, its specificity is quite low, suggesting that it may have a higher rate of false positives. This could indicate that the model is biased towards predicting positive cases.
# Compute the AUC for resort dataset
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Compute ROC curve
roc.resort <- roc(finaldf.test$is_canceled, predicted.test.prob.resort)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Compute AUC
auc_value <- round(auc(roc.resort), 3)
# Plot ROC curve
plot(roc.resort)
# Add AUC value to the plot
text(0.8, 0.2, paste("AUC =", auc_value), adj = c(0, 1), cex = 1.2)
# Area under the curve is 79.95%. It can be better
# run CART algorithm
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
model.cart = rpart(is_canceled ~ ., data = finaldf.train, method = "class")
prp(model.cart)
# Make predictions with default cut-off point of 0.5
predict_class.cart = predict(model.cart, newdata = finaldf.test, type = "class")
# Build the confusion Matrix
library(caret)
confusionMatrix(data = predict_class.cart, reference = as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13242 4101
## 1 1704 4831
##
## Accuracy : 0.7569
## 95% CI : (0.7514, 0.7623)
## No Information Rate : 0.6259
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4512
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8860
## Specificity : 0.5409
## Pos Pred Value : 0.7635
## Neg Pred Value : 0.7393
## Prevalence : 0.6259
## Detection Rate : 0.5546
## Detection Prevalence : 0.7263
## Balanced Accuracy : 0.7134
##
## 'Positive' Class : 0
##
# Make predictions with a different cut-off point
predict_prob.cart = predict(model.cart, newdata = finaldf.test, type = "prob")
predict_class.cart = ifelse(predict_prob.cart[,2] >= 0.7, 1, 0)
# Confusion Matrix for the different cut-off point
levels(predict_class.cart)
## NULL
levels(finaldf.test$is_canceled)
## NULL
library(caret)
confusionMatrix(data = as.factor(predict_class.cart), reference = as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 14648 6856
## 1 298 2076
##
## Accuracy : 0.7004
## 95% CI : (0.6945, 0.7062)
## No Information Rate : 0.6259
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2493
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9801
## Specificity : 0.2324
## Pos Pred Value : 0.6812
## Neg Pred Value : 0.8745
## Prevalence : 0.6259
## Detection Rate : 0.6135
## Detection Prevalence : 0.9006
## Balanced Accuracy : 0.6062
##
## 'Positive' Class : 0
##
# Compute AUC
library(pROC)
roc.curve = roc(response = finaldf.test$is_canceled, predictor = predict_prob.cart[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc.curve)
## Area under the curve: 0.7854
plot(roc.curve)
# Reduce overfitting
# Area under the curve is 77.73%, better than logistic regression model.
# Define cross-validation experiment
library(caret)
library(e1071)
numFolds = trainControl(method = "cv", number = 5)
cpGrid = expand.grid(.cp = seq(0.001, 0.5,0.005))
# Perfrom Cross Validation
finaldf.train = finaldf.train[complete.cases(finaldf.train),]
train(is_canceled ~ ., data = finaldf.train, method = "rpart", trControl = numFolds, tuneGrid = cpGrid)
## CART
##
## 95511 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 76409, 76409, 76409, 76409, 76408
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.7722147 0.48669666
## 0.006 0.7557769 0.44151257
## 0.011 0.7445320 0.43309321
## 0.016 0.7445320 0.43309321
## 0.021 0.7411396 0.43375148
## 0.026 0.7321251 0.43128498
## 0.031 0.7321251 0.43128498
## 0.036 0.7321251 0.43128498
## 0.041 0.7321251 0.43128498
## 0.046 0.7321251 0.43128498
## 0.051 0.7198123 0.37280794
## 0.056 0.6961923 0.25875287
## 0.061 0.6961923 0.25875287
## 0.066 0.6961923 0.25875287
## 0.071 0.6961923 0.25875287
## 0.076 0.6961923 0.25875287
## 0.081 0.6961923 0.25875287
## 0.086 0.6961923 0.25875287
## 0.091 0.6706243 0.16502847
## 0.096 0.6569189 0.09055975
## 0.101 0.6569189 0.09055975
## 0.106 0.6569189 0.09055975
## 0.111 0.6569189 0.09055975
## 0.116 0.6569189 0.09055975
## 0.121 0.6569189 0.09055975
## 0.126 0.6305033 0.00000000
## 0.131 0.6305033 0.00000000
## 0.136 0.6305033 0.00000000
## 0.141 0.6305033 0.00000000
## 0.146 0.6305033 0.00000000
## 0.151 0.6305033 0.00000000
## 0.156 0.6305033 0.00000000
## 0.161 0.6305033 0.00000000
## 0.166 0.6305033 0.00000000
## 0.171 0.6305033 0.00000000
## 0.176 0.6305033 0.00000000
## 0.181 0.6305033 0.00000000
## 0.186 0.6305033 0.00000000
## 0.191 0.6305033 0.00000000
## 0.196 0.6305033 0.00000000
## 0.201 0.6305033 0.00000000
## 0.206 0.6305033 0.00000000
## 0.211 0.6305033 0.00000000
## 0.216 0.6305033 0.00000000
## 0.221 0.6305033 0.00000000
## 0.226 0.6305033 0.00000000
## 0.231 0.6305033 0.00000000
## 0.236 0.6305033 0.00000000
## 0.241 0.6305033 0.00000000
## 0.246 0.6305033 0.00000000
## 0.251 0.6305033 0.00000000
## 0.256 0.6305033 0.00000000
## 0.261 0.6305033 0.00000000
## 0.266 0.6305033 0.00000000
## 0.271 0.6305033 0.00000000
## 0.276 0.6305033 0.00000000
## 0.281 0.6305033 0.00000000
## 0.286 0.6305033 0.00000000
## 0.291 0.6305033 0.00000000
## 0.296 0.6305033 0.00000000
## 0.301 0.6305033 0.00000000
## 0.306 0.6305033 0.00000000
## 0.311 0.6305033 0.00000000
## 0.316 0.6305033 0.00000000
## 0.321 0.6305033 0.00000000
## 0.326 0.6305033 0.00000000
## 0.331 0.6305033 0.00000000
## 0.336 0.6305033 0.00000000
## 0.341 0.6305033 0.00000000
## 0.346 0.6305033 0.00000000
## 0.351 0.6305033 0.00000000
## 0.356 0.6305033 0.00000000
## 0.361 0.6305033 0.00000000
## 0.366 0.6305033 0.00000000
## 0.371 0.6305033 0.00000000
## 0.376 0.6305033 0.00000000
## 0.381 0.6305033 0.00000000
## 0.386 0.6305033 0.00000000
## 0.391 0.6305033 0.00000000
## 0.396 0.6305033 0.00000000
## 0.401 0.6305033 0.00000000
## 0.406 0.6305033 0.00000000
## 0.411 0.6305033 0.00000000
## 0.416 0.6305033 0.00000000
## 0.421 0.6305033 0.00000000
## 0.426 0.6305033 0.00000000
## 0.431 0.6305033 0.00000000
## 0.436 0.6305033 0.00000000
## 0.441 0.6305033 0.00000000
## 0.446 0.6305033 0.00000000
## 0.451 0.6305033 0.00000000
## 0.456 0.6305033 0.00000000
## 0.461 0.6305033 0.00000000
## 0.466 0.6305033 0.00000000
## 0.471 0.6305033 0.00000000
## 0.476 0.6305033 0.00000000
## 0.481 0.6305033 0.00000000
## 0.486 0.6305033 0.00000000
## 0.491 0.6305033 0.00000000
## 0.496 0.6305033 0.00000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.
# Creating a new cart model with optimal value of cp
predict_class.cart.CV = rpart(is_canceled ~ ., data = finaldf.train, method = "class", cp = 0.006)
# Make Predictions
predict_class.cart = predict(predict_class.cart.CV, newdata = finaldf.test, type = "class")
levels(predict_class.cart)
## [1] "0" "1"
levels(finaldf.test$is_canceled)
## NULL
confusionMatrix(data = predict_class.cart, reference = as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13242 4101
## 1 1704 4831
##
## Accuracy : 0.7569
## 95% CI : (0.7514, 0.7623)
## No Information Rate : 0.6259
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4512
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8860
## Specificity : 0.5409
## Pos Pred Value : 0.7635
## Neg Pred Value : 0.7393
## Prevalence : 0.6259
## Detection Rate : 0.5546
## Detection Prevalence : 0.7263
## Balanced Accuracy : 0.7134
##
## 'Positive' Class : 0
##
# RANDOM FOREST
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
sum(is.na(finaldf.train))
## [1] 0
finaldf.train = na.omit(finaldf.train)
finaldf.train$is_canceled = as.factor(finaldf.train$is_canceled)
# Building Random Forest
model.rf = randomForest(is_canceled ~ ., data = finaldf.train, ntree = 50, importance = TRUE, mtry = 5)
# Make Predictions
predict_class.rf = predict(model.rf, newdata = finaldf.test)
levels(predict_class.rf)
## [1] "0" "1"
levels(finaldf.test$is_canceled)
## NULL
confusionMatrix(data = predict_class.rf, reference =as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13663 2699
## 1 1283 6230
##
## Accuracy : 0.8332
## 95% CI : (0.8284, 0.8379)
## No Information Rate : 0.626
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6321
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9142
## Specificity : 0.6977
## Pos Pred Value : 0.8350
## Neg Pred Value : 0.8292
## Prevalence : 0.6260
## Detection Rate : 0.5723
## Detection Prevalence : 0.6853
## Balanced Accuracy : 0.8059
##
## 'Positive' Class : 0
##
# Random Forest with cross-validation
#search_grid = expand.grid(.mtry = c(1:14))
#train(is_canceled ~ ., data = finaldf.train, method = "rf", trControl = numFolds, tuneGrid = search_grid, ntree = 50, nodesize = 10)
# Variable Importance
varImpPlot(model.rf, type = 1)
# Calculate probabilities for the positive class
probabilities <- predict(model.rf, newdata = finaldf.test, type = "prob")[, "1"]
# Create ROC curve object
roc_obj <- roc(finaldf.test$is_canceled, probabilities)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_obj, main = "ROC Curve for Random Forest Model",
xlab = "False Positive Rate", ylab = "True Positive Rate")
# Add AUC value to the plot
auc_val <- auc(roc_obj)
legend("bottomright", legend = paste("AUC =", round(auc_val, 2)), bty = "n")
# Print AUC value
print(paste("AUC:", round(auc_val, 2)))
## [1] "AUC: 0.89"
# K-means Clustering
# Hierarchical
# Computing Euclidean Distance
# The data is too large to calculate the euclidean distance
# Without calculating Euclidean Distance
# Standardize input variables
# Coverting Y variable to numeric from factor
finaldf.train$is_canceled = as.numeric(finaldf.train$is_canceled)
finaldf.train.std = sapply(finaldf.train, scale)
str(finaldf.train)
## 'data.frame': 95511 obs. of 15 variables:
## $ hotel : num 0 0 0 1 0 1 0 0 0 1 ...
## $ is_canceled : num 2 1 1 2 1 1 2 2 2 1 ...
## $ lead_time : int 482 4 25 101 124 0 69 120 133 75 ...
## $ stays_in_weekend_nights : int 0 1 0 0 2 0 2 2 0 2 ...
## $ stays_in_week_nights : int 2 1 1 3 5 1 4 3 3 1 ...
## $ adults : int 2 1 2 2 2 1 1 1 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ is_repeated_guest : int 0 1 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ booking_changes : int 0 0 0 0 0 0 0 2 0 0 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ adr : num 62.8 65 101 58 136.3 ...
## $ required_car_parking_spaces: int 0 0 0 0 0 1 0 0 0 0 ...
## $ total_of_special_requests : int 0 1 1 1 0 1 0 0 0 0 ...
# Add row names: finaldf
# row.names(finaldf.train.std) = row.names(finaldf.train.std)
# Compute standardized distance
# df.dist = finaldf.train.std[sample(nrow(finaldf.train.std), nrow(finaldf.train.std) * 0.25)]
# dist.std = dist(df.dist, method = "euclidean")
# hClustering = hclust(dist.std, method = "ward")
# Non-hierarchical
narows = apply(finaldf.train.std, 1, function(row) any(is.na(row) | is.nan(row) | is.infinite(row)))
# Remove rows
finaldf.train.std = finaldf.train.std[!narows, ]
# K-means clustering
k=7
set.seed(123)
kmClustering = kmeans(finaldf.train.std, centers = k)
# Show Cluster membership
s = kmClustering$cluster
# Cluster statistics
kmClustering$centers
## hotel is_canceled lead_time stays_in_weekend_nights
## 1 1.16374757 -0.105647455 0.4176796 1.51111170
## 2 -0.40946745 -0.582914641 -0.2059924 -0.05346063
## 3 0.66277133 -0.681514538 -0.5946749 -0.27289318
## 4 0.01563604 -0.759758988 -0.4471041 -0.28067558
## 5 0.33522331 0.005767307 -0.1379478 0.21627090
## 6 -0.51865283 0.909807385 2.0019031 -0.39834349
## 7 -0.29783452 1.306279525 -0.1089238 -0.13787909
## stays_in_week_nights adults children babies is_repeated_guest
## 1 1.8264433 0.14678106 -0.1832751 -0.07007546 -0.1614260
## 2 -0.1256073 0.48264076 -0.1434408 -0.07958246 -0.1813142
## 3 -0.3307525 -0.28591340 -0.1466892 -0.03802210 1.9490224
## 4 -0.3598094 -0.44526020 -0.2223697 -0.07968009 -0.1813142
## 5 0.2210426 0.19090650 3.5032865 1.37204260 -0.1572152
## 6 -0.2173627 0.08307433 -0.2505135 -0.08010403 -0.1016734
## 7 -0.1900797 0.01336179 -0.1982927 -0.08010403 -0.1813142
## previous_cancellations booking_changes days_in_waiting_list adr
## 1 -0.07524925 0.24424879 -0.11731732 -0.1405052
## 2 -0.09678149 -0.06457517 -0.12370772 0.3296197
## 3 0.08688877 0.14148488 -0.12709505 -0.2164521
## 4 -0.10059071 0.09659335 -0.09278039 -0.3701471
## 5 -0.09547695 0.38369645 -0.12888438 1.4470065
## 6 0.61588641 -0.21136720 1.08881070 -0.4017119
## 7 -0.03301575 -0.21919510 -0.11280256 0.0905418
## required_car_parking_spaces total_of_special_requests
## 1 -0.2063743 0.05352444
## 2 -0.2520843 1.16071374
## 3 2.5712547 0.20207500
## 4 -0.2546143 -0.49924713
## 5 0.1833781 0.38168342
## 6 -0.2537052 -0.58176909
## 7 -0.2546143 -0.44444007
kmClustering$size
## [1] 10025 19276 7645 23715 4964 8941 20945
# Define a function to assign probabilities based on cluster distance
assign_probabilities <- function(data, kmClustering) {
# Calculate distances to cluster centroids
distances <- stats::dist(rbind(data, kmClustering$centers))
# Assign probabilities based on the inverse of distances
probabilities <- 1 / apply(as.matrix(distances), 1, min)
# Normalize probabilities
probabilities <- probabilities / sum(probabilities)
return(probabilities)
}
# Apply the function to the training data
# train_probabilities <- assign_probabilities(finaldf.train.std, kmClustering)
# Create ROC curve object
# roc_obj <- roc(finaldf.train.std$is_canceled, train_probabilities)
# Plot ROC curve
# plot(roc_obj, main = "ROC Curve for K-means Clustering",
# xlab = "False Positive Rate", ylab = "True Positive Rate")
# Add AUC value to the plot
# auc_val <- auc(roc_obj)
# legend("bottomright", legend = paste("AUC =", round(auc_val, 2)), bty = "n")
# Print AUC value
# print(paste("AUC:", round(auc_val, 2)))
# Plot an empty scatter plot
plot(c(0), xaxt = 'n', ylab = "", type = "l", ylim = c(min(kmClustering$centers), max(kmClustering$centers)), xlim = c(0,8))
# label x-axis
axis(1, at = c(1:8), labels = names(finaldf)[1:8])
# Plot centroids
for (i in c(1:k)) lines(kmClustering$centers[i,], lty = i, lwd = 2, col = ifelse( i %in% c(1,3,5), "black", "dark grey"))
# Name Clusters
text(x = 0.5, y = kmClustering$cebters[,1], labels = paste("Cluster", c(1:k)))
# Determining the optimal value of k
# Calculate total within-cluster sum of squares (WSS) for different values of k
wss <- numeric(10)
for (i in 1:10) {
kmClustering <- kmeans(finaldf.train.std, centers = i)
wss[i] <- sum(kmClustering$withinss)
}
# Plot the elbow chart
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters (k)", ylab = "Total within-cluster sum of squares (WSS)",
main = "Elbow Method for Optimal k")
# Add a line to indicate the 'elbow'
elbow_point <- which(diff(wss) < mean(diff(wss)))[1]
points(elbow_point, wss[elbow_point], col = "red", cex = 2, pch = 19)
text(elbow_point, wss[elbow_point], labels = c("Elbow Point"), pos = 3, col = "red")