Data Mining Project - Hotel Demand

Introduction In today’s bustling hospitality landscape, managing hotel reservations efficiently is crucial for maximizing profits and ensuring guest satisfaction. However, the persistent challenge of reservation cancellations remains a headache for hotel owners, impacting their bottom line and complicating planning efforts. To address this issue head-on, our project aims to leverage data-driven methods to predict hotel reservation cancellations accurately. With various factors influencing booking decisions, from guest demographics to reservation specifics, understanding these dynamics can be both complex and rewarding. Our project delves into the “Hotel Demand” dataset to shed light on the factors driving demand in the hotel industry and the methodologies employed in modelling this phenomenon. Through our analysis, we seek to uncover patterns and correlations within the dataset to answer key questions such as what factors impact hotel demand and how various approaches contribute to model development. By exploring variables such as Average Daily Rate (ADR), booking channel, and lead time, we aim to discern the nuances of demand dynamics. Additionally, we’ll investigate factors like previous booking behaviour, deposit types, and special requests to gain insights into guest preferences and behaviours. Through this investigation, our aim is not only to address current queries but also to discover fresh insights that can improve our understanding of hotel demand and enhance predictive modelling techniques. By leveraging data-driven approaches, we strive to provide valuable insights that empower hoteliers to better manage reservations, optimize revenue generation, and enhance the guest experience.

Literature Review

  1. Using API with Logistic Regression Model to Predict Hotel Reservation Cancellation by Detecting the Cancellation Factors:

The project “Using API with Logistic Regression Model to Predict Hotel Reservation Cancellation by Detecting the Cancellation Factors” aims to develop a predictive model for hotel reservation cancellations by leveraging insights from existing literature on hotel cancellation policies, predictive modelling techniques, and logistic regression. By synthesizing key insights from studies investigating factors influencing cancellations and predictive modelling approaches, the project seeks to provide a comprehensive solution utilizing data mining techniques. Through the integration of logistic regression models into an API, hotel management systems will gain a predictive tool to assess the likelihood of reservation cancellations based on various parameters. This approach aligns with the broader goal of enhancing revenue management strategies, optimizing resource allocation, and improving customer service in the hospitality industry. By leveraging insights from diverse sources and disciplines, the project endeavours to offer a robust framework for decision-making, ultimately enhancing customer satisfaction and operational efficiency for hoteliers.

  1. An Analysis of Hotel Booking Cancellations and Factors Affecting Revenue Generation:

The study “An Analysis of Hotel Booking Cancellations and Factors Affecting Revenue Generation” by Aditya Dole explores the intricacies of cancelled hotel reservations in both City and Resort Hotel segments, aiming to identify causes and provide strategic guidance for revenue enhancement. Through a comprehensive literature review, previous research indicates the significant impact of factors such as pricing, location, and personal reasons on cancellation rates. Methodologically, the study adopts a systematic approach to data analysis, employing techniques like data pre-processing and advanced imputation methods. The research hypotheses focus on untangling factors influencing cancellations and orienting pricing decisions for hotel operators. Insights gleaned from the analysis underscore the importance of factors like booking lead time, client demographics, and accommodation types in shaping cancellation rates, suggesting tailored pricing strategies and customer-centric policies to mitigate revenue loss. Ultimately, the study highlights the critical role of data-driven decision-making in addressing challenges posed by booking cancellations and suggests future research directions to further advance understanding in this area. These findings closely align with our project’s goal of optimizing revenue management strategies in the hospitality sector, offering valuable insights that resonate with our objectives. Integrating insights from this research into our project framework will enable us to develop more robust predictive models and implement tailored pricing strategies, ultimately enhancing revenue generation and operational efficiency for hotels.

  1. Performance Comparison of Classification Algorithms in Hotel Booking Cancellation Prediction:

The research outlined in “Performance Comparison of Classification Algorithms in Hotel Booking Cancellation Prediction” directly correlates with our project’s focus on optimizing revenue management strategies in the hospitality sector. By evaluating various classification algorithms’ performance in predicting booking cancellations, the study underscores the critical role of accurate prediction in revenue management effectiveness. The findings, particularly the success of kNN and RF algorithms in achieving high accuracy rates, offer valuable insights into potential approaches for enhancing our project’s predictive models. Moreover, the exploration of dataset attributes and the impact of applying PCA on performance aligns with our project’s emphasis on leveraging diverse data sources and advanced techniques to improve prediction accuracy. The study evaluates six classification algorithms on a dataset spanning 2015-2017, with k-Nearest Neighbors (kNN) and Random Forest (RF) achieving the highest accuracy rates of 85%. With 119,390 reservation records and 30 attributes, including average daily room return and booking changes, the study employs cross-validation and Principal Component Analysis (PCA) to assess algorithm performance. Results align with our project’s focus on revenue management optimization in the hospitality sector, offering insights for enhancing predictive models and leveraging diverse data sources. The study’s emphasis on accuracy metrics, comparative analysis, and practical implications resonates with our project’s commitment to rigorous evaluation and real-world application. By leveraging insights from this research, our project aims to advance revenue management practices, driving improved profitability and operational efficiency for hotels.

  1. Identifying Critical Hotel Cancellations Using Artificial Intelligence:

The article titled “Identifying Critical Hotel Cancellations Using Artificial Intelligence” explores the application of artificial intelligence (AI) techniques to forecast hotel booking cancellations, offering significant opportunities for revenue optimization and management in the hospitality sector. Leveraging Personal Name Records (PNR) data, the study achieves promising results, with an impressive 80% accuracy in predicting cancellations made seven days in advance. This research addresses a crucial gap in existing literature by focusing on individual cancellations that have a substantial impact on revenue loss, particularly those occurring close to the time of service. The methodology employed simplifies data preparation and minimizes reliance on external data sources, enhancing its practical applicability for hotel managers. Practical implications of the study include the optimization of booking management systems and cancellation policies based on accurate forecasts, ultimately leading to improved revenue generation and enhanced customer satisfaction. Overall, this research underscores the potential of AI in addressing critical challenges in the hotel industry and emphasizes the importance of proactive revenue management strategies. This study aligns closely with our project on “Hotel Demand,” as it highlights the significance of accurately forecasting cancellations, which is essential for optimizing hotel operations and maximizing revenue. By leveraging AI techniques and PNR data, similar to the approach outlined in the article, our project aims to develop predictive models that can anticipate fluctuations in hotel demand. The methodologies and findings presented in this literature review provide valuable insights that will guide our project’s efforts to enhance revenue management strategies and improve customer satisfaction in the hospitality sector.

  1. Big Data in Hotel Revenue Management: Exploring Cancellation Drivers to Gain Insights Into Booking Cancellation Behavior:

The study “Big Data in Hotel Revenue Management: Exploring Cancellation Drivers to Gain Insights Into Booking Cancellation Behavior” directly aligns with our project objectives, which aim to optimize revenue management strategies in the hospitality industry. By addressing the challenge of booking cancellations through predictive analytics, the research offers valuable insights that resonate with our goals of improving demand forecasting accuracy and operational efficiency. Implementing similar machine learning techniques and leveraging diverse data sources will empower us to proactively identify and manage cancellations, thereby enhancing revenue generation and customer satisfaction. Additionally, the emphasis on dynamic cancellation policies underscores the importance of flexibility in adapting to changing booking patterns, a key aspect that we aim to integrate into our revenue management framework. Overall, this study provides a solid foundation and practical insights that inform and enrich our project’s approach to optimizing revenue management practices in the dynamic landscape of the hospitality sector.

References

• Almotiri, S. H., Alosaimi, N., & Abdullah, B. (2021). Using API with Logistic Regression Model to Predict Hotel Reservation Cancellation by Detecting the Cancellation Factors. International Journal of Advanced Computer Science & Applications (Online), 12(6). https://doi.org/10.14569/ijacsa.2021.0120688 • Dole, A. (2023). An analysis of hotel booking cancellations and factors affecting revenue generation. International Journal for Research in Applied Science and Engineering Technology, 11(10), 2141–2148. https://doi.org/10.22214/ijraset.2023.56407 • Tekin, M., & Gök, M. (2021, April 30). Performance comparison of classification algorithms in hotel booking cancellation prediction. https://dergipark.org.tr/en/pub/aita/issue/70741/1137735 • Mejia, C., Medina, A. J. S., & Pellejero, M. (2020). Identifying critical hotel cancellations using artificial intelligence. Tourism Management Perspectives (Print), 35, 100718. https://doi.org/10.1016/j.tmp.2020.100718 • António, N., De Almeida, A., & Nunes, L. (2019). Big Data in Hotel Revenue Management: Exploring cancellation drivers to gain insights into booking cancellation behavior. Cornell Hospitality Quarterly, 60(4), 298–319. https://doi.org/10.1177/1938965519851466

library(readr)
df = read.csv("Dataset 20-hotel_bookings.csv",  stringsAsFactors = TRUE)
str(df)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ country                       : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
##  $ market_segment                : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel          : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
##  $ assigned_room_type            : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ agent                         : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
##  $ company                       : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ reservation_status_date       : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
head(df)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03
# Replacing NULL values with most recurring values in the dataset.

cleaned.df = df
ourtable = table(cleaned.df$agent)
posmode = which.max(ourtable)
ourmode = names(posmode)
cleaned.df$agent[cleaned.df$agent == "NULL"] = ourmode

# Replacing NULL values with mode recurring values in the variable called "company"

ourmodeco = names(sort(table(cleaned.df$company), decreasing = TRUE))[2]
cleaned.df$company[cleaned.df$company == "NULL"] = ourmodeco
# Binary Variables:

# Getting to know the data

# Percentage of No cancellations and cancellations:

(showups = sum(cleaned.df$is_canceled == 0)/length(cleaned.df$is_canceled) * 100)
## [1] 62.95837
(noshowups = sum(cleaned.df$is_canceled == 1)/ length(cleaned.df$is_canceled) * 100)
## [1] 37.04163
# Visualization

pie_data <- c("Not Cancelled - 62.96%" = showups, "Cancelled - 37.04%" = noshowups)
pie(pie_data, main = "Cancellation Distribution - Guests who shows up and does not show up", col = c("lightblue", "lightgreen"))

# There are 62.96% people show up at the hotel without cancelling the booking and 37.04% guests cancel the bookings.

# Percentage of if the customer is a repeated guest or not:

(repeatedguest = sum(cleaned.df$is_repeated_guest == 1)/length(cleaned.df$is_repeated_guest) * 100)
## [1] 3.191222
(notrepeatedguest = sum(cleaned.df$is_repeated_guest == 0)/length(cleaned.df$is_repeated_guest) * 100)
## [1] 96.80878
# 96.81% guests were first time visitors and not repeated. Only 3.19% guests were repeated customers that speaks a lot about the services about the hotel. Why there are less or no repeated visitors?

# Visualization

pie_data_RG <- c("First time visitors - 96.81%" = notrepeatedguest, "Repeated Guests - 3.19%" = repeatedguest)
pie(pie_data_RG, main = "Distribution for guests if they are first time visitors or not", col = c("lightblue", "lightgreen"))

# Numerical Variables

# ADR

# Getting to know the variables, summary statistics:

summary(cleaned.df$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   69.29   94.58  101.83  126.00 5400.00
# ADR is Calculated by dividing the sum of all lodging transactions by the total number of staying nights. Basically average rate per night.

# Visualization

boxplot(summary(cleaned.df$adr),ylim = c(0, 6000), main = "Statistics summary for ADR", xlab = "Stats", ylab = "ADR")

# There is one outlier - for one observation, a guest has ADR of 5400 dollars. There is no enough data to support that one observation, we might decide to drop that particular row in the future.

boxplot(summary(cleaned.df$adr),ylim = c(0, 200), main = "Statistics summary for ADR", xlab = "Stats", ylab = "ADR")

# On an average, the rate for one night at the hotel/resort is little below 100 dollars.

# Lead time

# Summary Statistics

# Number of days that elapsed between the entering date of the booking into the PMS and the arrival date. It is basically the time from booking date till the guest is checking-in the hotel.

summary(cleaned.df$lead_time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      18      69     104     160     737
# Visualization

boxplot(summary(cleaned.df$lead_time), main = "Statistics summary for Lead Time", xlab = "Stats", ylab = "Lead Time")

# There are outliers in the data set.
# Categorical Variables

# Summary statistics

# Counting the frequency of each agent through whom the booking were made.

table(cleaned.df$agent)
## 
##     1    10   103   104   105   106   107    11   110   111   112   114   115 
##  7191   260    21    53    14     2     2   395    12    16    15     1   225 
##   117   118   119    12   121   122   126   127   128   129    13   132   133 
##     1    69   304   578    37     2    14    45    23    14    82   143    56 
##   134   135   138   139    14   141   142   143   144   146   147   148   149 
##   482     2   287     8  3640     6   137   172     1   124   156     4    28 
##    15   150   151   152   153   154   155   156   157   158   159    16   162 
##   402     5    56   183    25   193    94   190    61     1    89   246    37 
##   163   165   167   168    17   170   171   173   174   175   177   179   180 
##     7     1     3   184   241    93   607    29    22   195   347     2     4 
##   181   182   183   184   185   187    19   191   192   193   195   196   197 
##    59     8    45    52    78    24  1061   198    41    15   193   301     1 
##     2    20   201   205   208    21   210   211   213   214   215   216   219 
##   162   540    42    27   173   875     7     2     1     5    15     1    13 
##    22   220   223   227   229    23   232   234   235   236    24   240   241 
##   382   104    18     2   786    25     2   128    29   247    22 13922  1721 
##   242   243   244   245   247   248   249    25   250   251   252   253   254 
##   780   514     4    37     1   131    51     3  2870   220    29    87    29 
##   256   257   258    26   261   262   265   267   269    27   270   273   275 
##    24    24     3   401    38    22     1     1     2   450     6   349     8 
##   276   278    28   280   281   282   283   285   286   287   288   289    29 
##     8     1  1666     1    82     2     2     1    45     8    14     1   683 
##   290   291   294   295   296   298   299     3    30   300   301   302   303 
##    19     1     1     4    42   472     1  1336   484     1     1     3     2 
##   304   305   306   307   308    31   310   313   314   315    32   321   323 
##     1    45    35    14    54   162    25    36   927   284    15     3    25 
##   324   325   326   327   328    33   330   331   332   333   334   335   336 
##     9     6   165    20     9    31   125     2    55     1    28     4    23 
##   337   339    34   341   344   346   348    35   350   352   354   355   358 
##     1    77   294     4     8     1    22   109    28     1    14     4     1 
##   359    36   360   363   364   367   368    37   370   371   375   378    38 
##    21   100    15     6    19     1    45  1230     3     4    40    36   274 
##   384   385   387   388    39   390   391   393   394   397     4    40   403 
##     2    60    32     1   127    57     2    13    33     1    47  1039     4 
##   404   405   406   408    41   410   411   414   416   418    42   420   423 
##     2     5     1     1    75   133    16     2     1     8   211     3    19 
##   425   426   427   429   430   431   432   433   434   436   438    44   440 
##    16     3     3     5     4     1     1     1    33    49     2   292    56 
##   441   444   446   449    45   450   451   453   454   455   459   461   464 
##     7     1     1     2    32     1     1     1     2    19    16     2    98 
##   467   468   469    47   472   474   475   476   479   480   481   483   484 
##    39    49     2    50     1    17     8     2    32     1     8     1    11 
##   492   493   495   497     5    50   502   508   509   510    52   526   527 
##    28    35    57     1   330    20    24     6    10     2   137    10    35 
##    53   531   535    54    55    56    57    58    59     6    60    61    63 
##    18    68     3     1    16   375    28   335     1  3290    19     2    29 
##    64    66    67    68    69     7    70    71    72    73    74    75    77 
##    23    44   127   211    90  3539     1    73     6     1    20    73    33 
##    78    79     8    81    82    83    85    86    87    88    89     9    90 
##    37    47  1514     6    77   696   554   338    77    19    99 48301     1 
##    91    92    93    94    95    96    98    99  NULL 
##    58     7     1   114   135   537   124    68     0
cleaned.df$agent1 <- as.numeric(cleaned.df$agent)

cleaned.df$agent_category <- cut(cleaned.df$agent1, breaks = c(-Inf, 100, 200, 300, 400, Inf), labels = c("0-100", "100-200", "200-300", "300-400", "400+"))

# Visualization

library(ggplot2)

ggplot(cleaned.df, aes(x = agent_category, fill = agent_category)) +
  geom_bar() +
  labs(title = "Agent Code Frequency Distribution", x = "Agent Category", y = "Frequency") +
  theme_minimal()

# Counting the frequency of hotel bookings during each month of all years(2015-2017).

table(cleaned.df$arrival_date_month)
## 
##     April    August  December  February   January      July      June     March 
##     11089     13877      6780      8068      5929     12661     10939      9794 
##       May  November   October September 
##     11791      6794     11160     10508
library(ggplot2)
ggplot(cleaned.df, aes(x = arrival_date_month)) + geom_bar(fill = "lightblue", color = "black") + labs(title = "Months Distribution of the bookings", x = "Months(2015-2017)", y = "Frequency of bookings through agent") + theme_minimal() + scale_fill_brewer(palette = "Set3")

# Counting the frequency of hotel bookings from various companies.

table(cleaned.df$company)
## 
##     10    100    101    102    103    104    105    106    107    108    109 
##      1      1      1      1     16      1      8      2      9     11      1 
##     11    110    112    113    115    116    118     12    120    122    126 
##      1     52     13     36      4      6      7     14     14     18      1 
##    127    130    132    135    137    139     14    140    142    143    144 
##     15     12      1     66      4      3      9      1      1     17     27 
##    146    148    149    150    153    154    158    159     16    160    163 
##      3     37      5     19    215    133      2      6      5      1     17 
##    165    167    168    169    174    178    179     18    180    183    184 
##      3      7      2     65    149     27     24      1      5     16      1 
##    185    186    192    193    195    197     20    200    202    203    204 
##      4     12      4     16     38     47     50      3     38     13     34 
##    207    209    210    212    213    215    216    217    218    219     22 
##      9     19      2      1      1      8     21      2     43    141      6 
##    220    221    222    223    224    225    227    229    230    232    233 
##      4     27      2    784      3      7     24      1      3      2    114 
##    234    237    238    240    242    243    245    246    250    251    253 
##      1      1     33      3     62      2      3      3      2     18      1 
##    254    255    257    258    259    260    263    264    268    269    270 
##     10      6      1      1      2      3     14      2     14     33     43 
##    271    272    273    274    275    277    278    279     28    280    281 
##      2      3      1     14      3      5      2      8      5     48    138 
##    282    284    286    287    288    289     29    290    291    292    293 
##      4      1     21      5      1      2      2     17     12     18      5 
##    297    301    302    304    305    307    308    309     31    311    312 
##      7      1      5      2      1     36     33      1     17      2      3 
##    313    314    316    317    318    319     32    320    321    323    324 
##      1      1      2      9      1      3      1      1      2     10      9 
##    325    329    330    331    332    333    334    337    338     34    341 
##      2     12      4     61      2     11      3     25     12      8      5 
##    342    343    346    347    348    349     35    350    351    352    353 
##     48     29     14      1     59      2      1      3      2      1      4 
##    355    356    357    358    360    361    362    364    365    366    367 
##     13     10      5      7     12      2      2      6     29     24     14 
##    368    369     37    370    371    372    373    376    377    378    379 
##      1      5     10      2     11      3      1      1      5      3      9 
##     38    380    382    383    384    385    386    388     39    390    391 
##     51     12      5      6      9     30      1      7      8     13      2 
##    392    393    394    395    396    397    398    399     40    400    401 
##      4      1      6      4     18     15      1     11 113520      2      1 
##    402    403    405    407    408    409    410    411    412    413    415 
##      1      2    119     22     15     12      5      2      1      1      1 
##    416    417    418    419     42    420    421    422    423    424    425 
##      1      1     25      1      5      1      9      1      2     24      1 
##    426    428    429     43    433    435    436    437    439    442    443 
##      4     13      2     29      2     12      2      7      6      1      5 
##    444    445    446    447    448     45    450    451    452    454    455 
##      5      4      1      2      4    250     10      6      4      1      1 
##    456    457    458    459     46    460    461    465    466     47    470 
##      2      3      2      5     26      3      1     12      3     72      5 
##    477    478    479     48    481    482    483    484    485    486    487 
##     23      2      1      5      1      2      2      2     14      2      1 
##    489     49    490    491    492    494    496    497    498    499    501 
##      1      5      5      2      2      4      1      1     58      1      1 
##    504    506    507     51    511    512    513    514    515    516    518 
##     11      1     23     99      6      3      2      2      6      1      2 
##     52    520    521    523    525    528     53    530    531    534    539 
##      2      1      7     19     15      2      8      5      1      2      2 
##     54    541    543     59      6     61     62     64     65     67     68 
##      1      1      2      7      1      2     47      1      1    267     46 
##     71     72     73     76     77     78      8     80     81     82     83 
##      2     30      3      1      1     22      1      1     23     14      9 
##     84     85     86     88      9     91     92     93     94     96     99 
##      3      2     32     22     37     48     13      3     87      1     12 
##   NULL 
##      0
cleaned.df$company1 <- as.numeric(cleaned.df$company)
cleaned.df$company_category <- cut(cleaned.df$company1, breaks = c(-Inf, 92, 94, 96,98, 100, 102, Inf), labels = c("90-92", "93-94", "95-96", "97-98", "99-100", "100-101", "101-102"))

# Visualize compact data

ggplot(cleaned.df, aes(x = company_category, fill = company_category)) +
  geom_bar() +
  labs(title = "Company code Frequency Distribution", x = "Company Category", y = "Frequency") +
  theme_minimal()

# Further analyzing the data

# Deposit Type

table(cleaned.df$deposit_type)
## 
## No Deposit Non Refund Refundable 
##     104641      14587        162
t <- 104641 + 14587 + 162
(n <- 104641/t * 100)
## [1] 87.64637
(nr <- 14587/t * 100)
## [1] 12.21794
(r <- 162/t * 100)
## [1] 0.1356898
# Load the RColorBrewer package
library(RColorBrewer)

# Define the data
categories <- c("No Deposit - 87.64%", "Non Refund - 12.21%", "Refundable - 0.14%")
values <- c(104641, 14587, 162)

# Create a color palette with three colors
colors <- brewer.pal(3, "Set3")

# Create the pie chart with custom colors
pie(values, labels = categories, main = "Distribution of DepositType", col = colors)

# The majority of bookings (87.64%) were made without any deposit, while a smaller portion (12.21%) involved a non-refundable deposit equal to or exceeding the total stay cost. A negligible percentage (0.14%) of bookings were made with a refundable deposit amounting to less than the total stay cost.
# Integer Variables

# Adults

summary(cleaned.df$adults)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   1.856   2.000  55.000
plot(cleaned.df$arrival_date_month, cleaned.df$adults, xlab = "Months", ylab = "Number of adults coming at once", las = 2)

# There are few outliers, we cannot remove those, because those outlies altogether form a big chunk of the dataset. It even shows the ability of the hotel to cater the huge number of people arriving in groups at once.

# Children

summary(cleaned.df$children)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1039  0.0000 10.0000       4
plot(cleaned.df$arrival_date_month, cleaned.df$children, xlab = "Months", ylab = "Number of children coming at once", las = 2)

# Stays in week nights

summary(cleaned.df$stays_in_week_nights)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     1.0     2.0     2.5     3.0    50.0
boxplot(summary(cleaned.df$stays_in_week_nights), main = "Statistics summary for weekend nights", xlab = "Stats", ylab = "Stays in week nights")

# One outlier, most bookings are for week nights less than 10. A very few booking/s are for more than 45 days, close to 50.

# Stays in weekend nights

summary(cleaned.df$stays_in_weekend_nights)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.9276  2.0000 19.0000
boxplot(summary(cleaned.df$stays_in_weekend_nights), main = "Statistics summary for weekend nights", xlab = "Stats", ylab = "Stays in weekend nights")

# One outlier, most bookings are for weekend nights less than 5. A very few booking/s are for more than 15 weekend nights.
# Month Variable

table(cleaned.df$arrival_date_month)
## 
##     April    August  December  February   January      July      June     March 
##     11089     13877      6780      8068      5929     12661     10939      9794 
##       May  November   October September 
##     11791      6794     11160     10508
barplot(sort(table(cleaned.df$arrival_date_month), decreasing = TRUE), main = "Distribution of bookings over various months", ylab = "Distribution", las = 2, col = c("lightblue"))

# July and August shows more number of bookings during the entire span of three years.

# Year 

(table(cleaned.df$arrival_date_year))
## 
##  2015  2016  2017 
## 21996 56707 40687
x <- as.numeric(names(table(cleaned.df$arrival_date_year)))
class(x)
## [1] "numeric"
y <- as.numeric(c("21996", "56707", "4068"))
class(y)
## [1] "numeric"
plot(x, y, type = "o", main = "Distribution of bookings over various years", xlab = "Year", ylab = "Distribution")

# We can see the bookings dipped in the year 2017 and 2016 being the year with most number of hotel bookings.
table(cleaned.df$hotel)
## 
##   City Hotel Resort Hotel 
##        79330        40060
# Percentage of hotels vs resorts:

(hotels = sum(cleaned.df$hotel == "City Hotel")/length(cleaned.df$hotel) * 100)
## [1] 66.4461
(resorts = sum(cleaned.df$hotel == "Resort Hotel")/length(cleaned.df$hotel) * 100)
## [1] 33.5539
# 66.46% guests are visiting city hotel. Whereas 33.55% guests are visiting Resort Hotel.

# Visualization

stayat <- c("Hotel visitors - 66.46%" = hotels, "Resort Visitors - 33.55%" = resorts)
pie(stayat, main = "Distribution for guests if they are arriving at the hotel or at the resort", col = c("lightblue", "lightgreen"))

# In this stage, we have dropped all the un-neccessary categorical values and we have divided the dataset into two parts. The hotel provides two types of stay: Resort and City hotel. 


# In the previous phase, we had already removed and replaced NULL and undefined values to get a clean dataset.

### RESORT HOTEL DATASET


# Converting the categorical variable to numeric

cleaned.df$is_canceled <- as.numeric(cleaned.df$is_canceled)
View(cleaned.df)


# Running logistic regression on all 32 variables is time consuming, system doesnt support.

# Filtering the dataset, removing the unnecessary columns

finaldf <- subset(cleaned.df, select = -c(4, 5, 6, 7, 13, 14, 15, 16, 20, 21, 23, 24, 25, 27, 31, 32, 33, 34, 35, 36))
head(finaldf)
##          hotel is_canceled lead_time stays_in_weekend_nights
## 1 Resort Hotel           0       342                       0
## 2 Resort Hotel           0       737                       0
## 3 Resort Hotel           0         7                       0
## 4 Resort Hotel           0        13                       0
## 5 Resort Hotel           0        14                       0
## 6 Resort Hotel           0        14                       0
##   stays_in_week_nights adults children babies is_repeated_guest
## 1                    0      2        0      0                 0
## 2                    0      2        0      0                 0
## 3                    1      1        0      0                 0
## 4                    1      1        0      0                 0
## 5                    2      2        0      0                 0
## 6                    2      2        0      0                 0
##   previous_cancellations previous_bookings_not_canceled booking_changes
## 1                      0                              0               3
## 2                      0                              0               4
## 3                      0                              0               0
## 4                      0                              0               0
## 5                      0                              0               0
## 6                      0                              0               0
##   days_in_waiting_list adr required_car_parking_spaces
## 1                    0   0                           0
## 2                    0   0                           0
## 3                    0  75                           0
## 4                    0  75                           0
## 5                    0  98                           0
## 6                    0  98                           0
##   total_of_special_requests
## 1                         0
## 2                         0
## 3                         0
## 4                         0
## 5                         1
## 6                         1
str(finaldf)
## 'data.frame':    119390 obs. of  16 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
# Correlation between the categorical variables (0&1)

count = table(finaldf$is_canceled, finaldf$is_repeated_guest, dnn = c("iscancelled", "isrepeatedguest"))

round(prop.table(count, margin = 2), 2)
##            isrepeatedguest
## iscancelled    0    1
##           0 0.62 0.86
##           1 0.38 0.14
# Correlation Matrix for final df

library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.3.3
head(finaldf)
##          hotel is_canceled lead_time stays_in_weekend_nights
## 1 Resort Hotel           0       342                       0
## 2 Resort Hotel           0       737                       0
## 3 Resort Hotel           0         7                       0
## 4 Resort Hotel           0        13                       0
## 5 Resort Hotel           0        14                       0
## 6 Resort Hotel           0        14                       0
##   stays_in_week_nights adults children babies is_repeated_guest
## 1                    0      2        0      0                 0
## 2                    0      2        0      0                 0
## 3                    1      1        0      0                 0
## 4                    1      1        0      0                 0
## 5                    2      2        0      0                 0
## 6                    2      2        0      0                 0
##   previous_cancellations previous_bookings_not_canceled booking_changes
## 1                      0                              0               3
## 2                      0                              0               4
## 3                      0                              0               0
## 4                      0                              0               0
## 5                      0                              0               0
## 6                      0                              0               0
##   days_in_waiting_list adr required_car_parking_spaces
## 1                    0   0                           0
## 2                    0   0                           0
## 3                    0  75                           0
## 4                    0  75                           0
## 5                    0  98                           0
## 6                    0  98                           0
##   total_of_special_requests
## 1                         0
## 2                         0
## 3                         0
## 4                         0
## 5                         1
## 6                         1
# converting hotel column into a dummy variable

finaldf$hotel = ifelse(finaldf$hotel == "Resort Hotel", 1, 0)
str(finaldf)
## 'data.frame':    119390 obs. of  16 variables:
##  $ hotel                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ is_canceled                   : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
# Correlation Matrix

ggcorrplot(cor(finaldf), type = "lower", lab = TRUE,
method = "circle", title = "Correlation Matrix Heatmap")

# As we can see in the heatmap of correlation matrix for resort dataset, there is no multi-colinearity in the dataset, which is apparently, a best case scenario.
# Dividing the data into training dataset and testing dataset for Resort df:

# Randomizing the dataset

finaldf.randomized = finaldf[sample(nrow(finaldf), replace = FALSE), ]

# Determine the number of rows in the dataset
nrows = nrow(finaldf.randomized)

# Calculate the number of rows for the training set (80%)
trainsize = round(0.8 * nrows)

# Create the training and testing sets

finaldf.train = finaldf.randomized[1:trainsize, ]
finaldf.test = finaldf.randomized[(trainsize + 1): nrows, ]
#Regression with resort training dataset

mod1 = glm(is_canceled ~ ., data = finaldf.train, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod1)
## 
## Call:
## glm(formula = is_canceled ~ ., family = "binomial", data = finaldf.train)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -1.412e+00  3.391e-02 -41.640  < 2e-16 ***
## hotel                          -2.454e-01  1.799e-02 -13.640  < 2e-16 ***
## lead_time                       4.780e-03  7.918e-05  60.363  < 2e-16 ***
## stays_in_weekend_nights        -5.658e-03  8.863e-03  -0.638    0.523    
## stays_in_week_nights            2.313e-02  4.814e-03   4.805 1.54e-06 ***
## adults                          1.104e-01  1.652e-02   6.685 2.30e-11 ***
## children                        8.048e-02  2.050e-02   3.925 8.67e-05 ***
## babies                          9.270e-02  8.930e-02   1.038    0.299    
## is_repeated_guest              -1.071e+00  9.440e-02 -11.350  < 2e-16 ***
## previous_cancellations          2.939e+00  6.358e-02  46.227  < 2e-16 ***
## previous_bookings_not_canceled -5.755e-01  2.917e-02 -19.729  < 2e-16 ***
## booking_changes                -6.409e-01  1.762e-02 -36.363  < 2e-16 ***
## days_in_waiting_list           -1.923e-03  4.006e-04  -4.799 1.59e-06 ***
## adr                             6.379e-03  1.896e-04  33.642  < 2e-16 ***
## required_car_parking_spaces    -4.044e+01  6.850e+01  -0.590    0.555    
## total_of_special_requests      -7.142e-01  1.156e-02 -61.803  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 125824  on 95510  degrees of freedom
## Residual deviance:  99795  on 95495  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 99827
## 
## Number of Fisher Scoring iterations: 14
finaldf.train$is_canceled = as.factor(finaldf.train$is_canceled)
head(finaldf.train)
##        hotel is_canceled lead_time stays_in_weekend_nights stays_in_week_nights
## 59829      0           1       482                       0                    2
## 94226      0           0         4                       1                    1
## 94020      0           0        25                       0                    1
## 6126       1           1       101                       0                    3
## 114184     0           0       124                       2                    5
## 30682      1           0         0                       0                    1
##        adults children babies is_repeated_guest previous_cancellations
## 59829       2        0      0                 0                      0
## 94226       1        0      0                 1                      0
## 94020       2        0      0                 0                      0
## 6126        2        0      0                 0                      0
## 114184      2        0      0                 0                      0
## 30682       1        0      0                 0                      0
##        previous_bookings_not_canceled booking_changes days_in_waiting_list
## 59829                               0               0                    0
## 94226                               8               0                    0
## 94020                               0               0                    0
## 6126                                0               0                    0
## 114184                              0               0                    0
## 30682                               0               0                    0
##           adr required_car_parking_spaces total_of_special_requests
## 59829   62.80                           0                         0
## 94226   65.00                           0                         1
## 94020  101.00                           0                         1
## 6126    58.00                           0                         1
## 114184 136.29                           0                         0
## 30682   38.00                           1                         1
# Remove column "previous_bookings_not_canceled" from finaldf.train
finaldf.train <- finaldf.train[, !colnames(finaldf.train) %in% "previous_bookings_not_canceled"]

# Display the modified dataframe
head(finaldf.train)
##        hotel is_canceled lead_time stays_in_weekend_nights stays_in_week_nights
## 59829      0           1       482                       0                    2
## 94226      0           0         4                       1                    1
## 94020      0           0        25                       0                    1
## 6126       1           1       101                       0                    3
## 114184     0           0       124                       2                    5
## 30682      1           0         0                       0                    1
##        adults children babies is_repeated_guest previous_cancellations
## 59829       2        0      0                 0                      0
## 94226       1        0      0                 1                      0
## 94020       2        0      0                 0                      0
## 6126        2        0      0                 0                      0
## 114184      2        0      0                 0                      0
## 30682       1        0      0                 0                      0
##        booking_changes days_in_waiting_list    adr required_car_parking_spaces
## 59829                0                    0  62.80                           0
## 94226                0                    0  65.00                           0
## 94020                0                    0 101.00                           0
## 6126                 0                    0  58.00                           0
## 114184               0                    0 136.29                           0
## 30682                0                    0  38.00                           1
##        total_of_special_requests
## 59829                          0
## 94226                          1
## 94020                          1
## 6126                           1
## 114184                         0
## 30682                          1
mod2 = glm(is_canceled ~ ., data = finaldf.train, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod2)
## 
## Call:
## glm(formula = is_canceled ~ ., family = "binomial", data = finaldf.train)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.442e+00  3.393e-02 -42.504  < 2e-16 ***
## hotel                       -2.585e-01  1.791e-02 -14.437  < 2e-16 ***
## lead_time                    4.919e-03  7.883e-05  62.399  < 2e-16 ***
## stays_in_weekend_nights     -6.065e-03  8.818e-03  -0.688    0.492    
## stays_in_week_nights         2.333e-02  4.794e-03   4.867 1.13e-06 ***
## adults                       1.336e-01  1.662e-02   8.037 9.19e-16 ***
## children                     8.241e-02  2.054e-02   4.013 6.00e-05 ***
## babies                       1.097e-01  8.869e-02   1.237    0.216    
## is_repeated_guest           -1.829e+00  8.030e-02 -22.781  < 2e-16 ***
## previous_cancellations       2.074e+00  4.701e-02  44.112  < 2e-16 ***
## booking_changes             -6.569e-01  1.769e-02 -37.133  < 2e-16 ***
## days_in_waiting_list        -2.035e-03  4.009e-04  -5.077 3.83e-07 ***
## adr                          6.331e-03  1.897e-04  33.369  < 2e-16 ***
## required_car_parking_spaces -2.917e+01  3.580e+01  -0.815    0.415    
## total_of_special_requests   -7.295e-01  1.153e-02 -63.242  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 125824  on 95510  degrees of freedom
## Residual deviance: 100949  on 95496  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 100979
## 
## Number of Fisher Scoring iterations: 14
# How well the model is being trained for RESORT DATASET?

predicted.train.prob.resort = predict(mod2, newdata = finaldf.train, type = "response")

levels(factor(finaldf.train$is_canceled))
## [1] "0" "1"
# Use cutoff point of 0.5 to convert the probabilities to class

predicted.train.class.resort = ifelse(predicted.train.prob.resort > 0.5, 1, 0)
#Building the confusion matrix

library(caret)
## Loading required package: lattice
confusionMatrix(as.factor(predicted.train.class.resort),as.factor(finaldf.train$is_canceled))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 52663 17994
##          1  7557 17297
##                                           
##                Accuracy : 0.7325          
##                  95% CI : (0.7297, 0.7353)
##     No Information Rate : 0.6305          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3884          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8745          
##             Specificity : 0.4901          
##          Pos Pred Value : 0.7453          
##          Neg Pred Value : 0.6959          
##              Prevalence : 0.6305          
##          Detection Rate : 0.5514          
##    Detection Prevalence : 0.7398          
##       Balanced Accuracy : 0.6823          
##                                           
##        'Positive' Class : 0               
## 
# Accuracy: The accuracy of the model is 75.2%, indicating that 75.2% of the predictions made by the model are correct.

# 95% Confidence Interval (CI): The CI for the accuracy of the model is between 74.73% and 75.67%. This means that we are 95% confident that the true accuracy of the model falls within this range.

# Sensitivity (True Positive Rate): The sensitivity of the model is 96.31%. This indicates that the model correctly identifies 96.31% of the positive cases.

# Specificity (True Negative Rate): The specificity of the model is 20.11%. This means that the model correctly identifies only 20.11% of the negative cases.

# Interpretation: This model demonstrates very high sensitivity, indicating its strong ability to correctly identify positive cases. However, its specificity is quite low, suggesting that it may have a higher rate of false positives. This could indicate that the model is biased towards predicting positive cases.
# For resort dataset

# Checking the performance of the model on testing dataset.

predicted.test.prob.resort = predict(mod1, newdata = finaldf.test, type = "response")

# First 5 actual and predicted records

data.frame(actual = finaldf.test$is_canceled[1:5], predicted = predicted.test.prob.resort[1:5])
##        actual predicted
## 25155       0 0.1028243
## 113435      0 0.3159719
## 88879       0 0.3645734
## 46378       1 0.4697018
## 74978       1 0.9683660
# Use cutoff point of 0.5 to convert probabilities to class

predicted.test.class.resort = ifelse(predicted.test.prob.resort > 0.5, 1, 0)

# Confusion Matrix of the said data

confusionMatrix(as.factor(predicted.test.class.resort), as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13091  4470
##          1  1855  4459
##                                           
##                Accuracy : 0.7351          
##                  95% CI : (0.7294, 0.7407)
##     No Information Rate : 0.626           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3988          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8759          
##             Specificity : 0.4994          
##          Pos Pred Value : 0.7455          
##          Neg Pred Value : 0.7062          
##              Prevalence : 0.6260          
##          Detection Rate : 0.5483          
##    Detection Prevalence : 0.7355          
##       Balanced Accuracy : 0.6876          
##                                           
##        'Positive' Class : 0               
## 
# Accuracy: The accuracy of the model is 74.61%, indicating that 74.61% of the predictions made by the model are correct.

# 95% Confidence Interval (CI): The CI for the accuracy of the model is between 73.65% and 75.56%. This means that we are 95% confident that the true accuracy of the model falls within this range.

# Sensitivity (True Positive Rate): The sensitivity of the model is 95.72%. This indicates that the model correctly identifies 95.72% of the positive cases.

# Specificity (True Negative Rate): The specificity of the model is 20.37%. This means that the model correctly identifies 20.37% of the negative cases.

# Interpretation: This model demonstrates very high sensitivity, indicating its strong ability to correctly identify positive cases. However, similar to previous models, its specificity is quite low, suggesting that it may have a higher rate of false positives. This could indicate that the model is biased towards predicting positive cases.
# Compute the AUC for resort dataset

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Compute ROC curve
roc.resort <- roc(finaldf.test$is_canceled, predicted.test.prob.resort)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Compute AUC
auc_value <- round(auc(roc.resort), 3)

# Plot ROC curve
plot(roc.resort)

# Add AUC value to the plot
text(0.8, 0.2, paste("AUC =", auc_value), adj = c(0, 1), cex = 1.2)

# Area under the curve is 79.95%. It can be better
# run CART algorithm

library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
model.cart = rpart(is_canceled ~ ., data = finaldf.train, method = "class")

prp(model.cart)

# Make predictions with default cut-off point of 0.5

predict_class.cart = predict(model.cart, newdata = finaldf.test, type = "class")

# Build the confusion Matrix

library(caret)
confusionMatrix(data = predict_class.cart, reference = as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13242  4101
##          1  1704  4831
##                                           
##                Accuracy : 0.7569          
##                  95% CI : (0.7514, 0.7623)
##     No Information Rate : 0.6259          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4512          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8860          
##             Specificity : 0.5409          
##          Pos Pred Value : 0.7635          
##          Neg Pred Value : 0.7393          
##              Prevalence : 0.6259          
##          Detection Rate : 0.5546          
##    Detection Prevalence : 0.7263          
##       Balanced Accuracy : 0.7134          
##                                           
##        'Positive' Class : 0               
## 
# Make predictions with a different cut-off point

predict_prob.cart = predict(model.cart, newdata = finaldf.test, type = "prob")

predict_class.cart = ifelse(predict_prob.cart[,2] >= 0.7, 1, 0)

# Confusion Matrix for the different cut-off point


levels(predict_class.cart)
## NULL
levels(finaldf.test$is_canceled)
## NULL
library(caret)
confusionMatrix(data = as.factor(predict_class.cart), reference = as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 14648  6856
##          1   298  2076
##                                           
##                Accuracy : 0.7004          
##                  95% CI : (0.6945, 0.7062)
##     No Information Rate : 0.6259          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2493          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9801          
##             Specificity : 0.2324          
##          Pos Pred Value : 0.6812          
##          Neg Pred Value : 0.8745          
##              Prevalence : 0.6259          
##          Detection Rate : 0.6135          
##    Detection Prevalence : 0.9006          
##       Balanced Accuracy : 0.6062          
##                                           
##        'Positive' Class : 0               
## 
# Compute AUC

library(pROC)

roc.curve = roc(response = finaldf.test$is_canceled, predictor = predict_prob.cart[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc.curve)
## Area under the curve: 0.7854
plot(roc.curve)

# Reduce overfitting
# Area under the curve is 77.73%, better than logistic regression model.
# Define cross-validation experiment

library(caret)
library(e1071)
numFolds = trainControl(method = "cv", number = 5)
cpGrid = expand.grid(.cp = seq(0.001, 0.5,0.005))

# Perfrom Cross Validation

finaldf.train = finaldf.train[complete.cases(finaldf.train),]

train(is_canceled ~ ., data = finaldf.train, method = "rpart", trControl = numFolds, tuneGrid = cpGrid)
## CART 
## 
## 95511 samples
##    14 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 76409, 76409, 76409, 76409, 76408 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa     
##   0.001  0.7722147  0.48669666
##   0.006  0.7557769  0.44151257
##   0.011  0.7445320  0.43309321
##   0.016  0.7445320  0.43309321
##   0.021  0.7411396  0.43375148
##   0.026  0.7321251  0.43128498
##   0.031  0.7321251  0.43128498
##   0.036  0.7321251  0.43128498
##   0.041  0.7321251  0.43128498
##   0.046  0.7321251  0.43128498
##   0.051  0.7198123  0.37280794
##   0.056  0.6961923  0.25875287
##   0.061  0.6961923  0.25875287
##   0.066  0.6961923  0.25875287
##   0.071  0.6961923  0.25875287
##   0.076  0.6961923  0.25875287
##   0.081  0.6961923  0.25875287
##   0.086  0.6961923  0.25875287
##   0.091  0.6706243  0.16502847
##   0.096  0.6569189  0.09055975
##   0.101  0.6569189  0.09055975
##   0.106  0.6569189  0.09055975
##   0.111  0.6569189  0.09055975
##   0.116  0.6569189  0.09055975
##   0.121  0.6569189  0.09055975
##   0.126  0.6305033  0.00000000
##   0.131  0.6305033  0.00000000
##   0.136  0.6305033  0.00000000
##   0.141  0.6305033  0.00000000
##   0.146  0.6305033  0.00000000
##   0.151  0.6305033  0.00000000
##   0.156  0.6305033  0.00000000
##   0.161  0.6305033  0.00000000
##   0.166  0.6305033  0.00000000
##   0.171  0.6305033  0.00000000
##   0.176  0.6305033  0.00000000
##   0.181  0.6305033  0.00000000
##   0.186  0.6305033  0.00000000
##   0.191  0.6305033  0.00000000
##   0.196  0.6305033  0.00000000
##   0.201  0.6305033  0.00000000
##   0.206  0.6305033  0.00000000
##   0.211  0.6305033  0.00000000
##   0.216  0.6305033  0.00000000
##   0.221  0.6305033  0.00000000
##   0.226  0.6305033  0.00000000
##   0.231  0.6305033  0.00000000
##   0.236  0.6305033  0.00000000
##   0.241  0.6305033  0.00000000
##   0.246  0.6305033  0.00000000
##   0.251  0.6305033  0.00000000
##   0.256  0.6305033  0.00000000
##   0.261  0.6305033  0.00000000
##   0.266  0.6305033  0.00000000
##   0.271  0.6305033  0.00000000
##   0.276  0.6305033  0.00000000
##   0.281  0.6305033  0.00000000
##   0.286  0.6305033  0.00000000
##   0.291  0.6305033  0.00000000
##   0.296  0.6305033  0.00000000
##   0.301  0.6305033  0.00000000
##   0.306  0.6305033  0.00000000
##   0.311  0.6305033  0.00000000
##   0.316  0.6305033  0.00000000
##   0.321  0.6305033  0.00000000
##   0.326  0.6305033  0.00000000
##   0.331  0.6305033  0.00000000
##   0.336  0.6305033  0.00000000
##   0.341  0.6305033  0.00000000
##   0.346  0.6305033  0.00000000
##   0.351  0.6305033  0.00000000
##   0.356  0.6305033  0.00000000
##   0.361  0.6305033  0.00000000
##   0.366  0.6305033  0.00000000
##   0.371  0.6305033  0.00000000
##   0.376  0.6305033  0.00000000
##   0.381  0.6305033  0.00000000
##   0.386  0.6305033  0.00000000
##   0.391  0.6305033  0.00000000
##   0.396  0.6305033  0.00000000
##   0.401  0.6305033  0.00000000
##   0.406  0.6305033  0.00000000
##   0.411  0.6305033  0.00000000
##   0.416  0.6305033  0.00000000
##   0.421  0.6305033  0.00000000
##   0.426  0.6305033  0.00000000
##   0.431  0.6305033  0.00000000
##   0.436  0.6305033  0.00000000
##   0.441  0.6305033  0.00000000
##   0.446  0.6305033  0.00000000
##   0.451  0.6305033  0.00000000
##   0.456  0.6305033  0.00000000
##   0.461  0.6305033  0.00000000
##   0.466  0.6305033  0.00000000
##   0.471  0.6305033  0.00000000
##   0.476  0.6305033  0.00000000
##   0.481  0.6305033  0.00000000
##   0.486  0.6305033  0.00000000
##   0.491  0.6305033  0.00000000
##   0.496  0.6305033  0.00000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.
# Creating a new cart model with optimal value of cp

predict_class.cart.CV = rpart(is_canceled ~ ., data = finaldf.train, method = "class", cp = 0.006)

# Make Predictions

predict_class.cart = predict(predict_class.cart.CV, newdata = finaldf.test, type = "class")

levels(predict_class.cart)
## [1] "0" "1"
levels(finaldf.test$is_canceled)
## NULL
confusionMatrix(data = predict_class.cart, reference = as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13242  4101
##          1  1704  4831
##                                           
##                Accuracy : 0.7569          
##                  95% CI : (0.7514, 0.7623)
##     No Information Rate : 0.6259          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4512          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8860          
##             Specificity : 0.5409          
##          Pos Pred Value : 0.7635          
##          Neg Pred Value : 0.7393          
##              Prevalence : 0.6259          
##          Detection Rate : 0.5546          
##    Detection Prevalence : 0.7263          
##       Balanced Accuracy : 0.7134          
##                                           
##        'Positive' Class : 0               
## 
# RANDOM FOREST

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
sum(is.na(finaldf.train))
## [1] 0
finaldf.train = na.omit(finaldf.train)

finaldf.train$is_canceled = as.factor(finaldf.train$is_canceled)

# Building Random Forest

model.rf = randomForest(is_canceled ~ ., data = finaldf.train, ntree = 50, importance = TRUE, mtry = 5)

# Make Predictions

predict_class.rf = predict(model.rf, newdata = finaldf.test)

levels(predict_class.rf)
## [1] "0" "1"
levels(finaldf.test$is_canceled)
## NULL
confusionMatrix(data = predict_class.rf, reference =as.factor(finaldf.test$is_canceled))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13663  2699
##          1  1283  6230
##                                           
##                Accuracy : 0.8332          
##                  95% CI : (0.8284, 0.8379)
##     No Information Rate : 0.626           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6321          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9142          
##             Specificity : 0.6977          
##          Pos Pred Value : 0.8350          
##          Neg Pred Value : 0.8292          
##              Prevalence : 0.6260          
##          Detection Rate : 0.5723          
##    Detection Prevalence : 0.6853          
##       Balanced Accuracy : 0.8059          
##                                           
##        'Positive' Class : 0               
## 
# Random Forest with cross-validation

#search_grid = expand.grid(.mtry = c(1:14))
#train(is_canceled ~ ., data = finaldf.train, method = "rf", trControl = numFolds, tuneGrid = search_grid, ntree = 50, nodesize = 10)

# Variable Importance

varImpPlot(model.rf, type = 1)

# Calculate probabilities for the positive class
probabilities <- predict(model.rf, newdata = finaldf.test, type = "prob")[, "1"]

# Create ROC curve object
roc_obj <- roc(finaldf.test$is_canceled, probabilities)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_obj, main = "ROC Curve for Random Forest Model",
     xlab = "False Positive Rate", ylab = "True Positive Rate")

# Add AUC value to the plot
auc_val <- auc(roc_obj)
legend("bottomright", legend = paste("AUC =", round(auc_val, 2)), bty = "n")

# Print AUC value
print(paste("AUC:", round(auc_val, 2)))
## [1] "AUC: 0.89"

K-means Clustering!!

# K-means Clustering

# Hierarchical

# Computing Euclidean Distance

# The data is too large to calculate the euclidean distance

# Without calculating Euclidean Distance

# Standardize input variables

# Coverting Y variable to numeric from factor

finaldf.train$is_canceled = as.numeric(finaldf.train$is_canceled)

finaldf.train.std = sapply(finaldf.train, scale)
str(finaldf.train)
## 'data.frame':    95511 obs. of  15 variables:
##  $ hotel                      : num  0 0 0 1 0 1 0 0 0 1 ...
##  $ is_canceled                : num  2 1 1 2 1 1 2 2 2 1 ...
##  $ lead_time                  : int  482 4 25 101 124 0 69 120 133 75 ...
##  $ stays_in_weekend_nights    : int  0 1 0 0 2 0 2 2 0 2 ...
##  $ stays_in_week_nights       : int  2 1 1 3 5 1 4 3 3 1 ...
##  $ adults                     : int  2 1 2 2 2 1 1 1 2 2 ...
##  $ children                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_repeated_guest          : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ booking_changes            : int  0 0 0 0 0 0 0 2 0 0 ...
##  $ days_in_waiting_list       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ adr                        : num  62.8 65 101 58 136.3 ...
##  $ required_car_parking_spaces: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ total_of_special_requests  : int  0 1 1 1 0 1 0 0 0 0 ...
# Add row names: finaldf

# row.names(finaldf.train.std) = row.names(finaldf.train.std)

# Compute standardized distance

# df.dist = finaldf.train.std[sample(nrow(finaldf.train.std), nrow(finaldf.train.std) * 0.25)]
# dist.std = dist(df.dist, method = "euclidean")

# hClustering = hclust(dist.std, method = "ward")
# Non-hierarchical

narows = apply(finaldf.train.std, 1, function(row) any(is.na(row) | is.nan(row) | is.infinite(row)))

# Remove rows

finaldf.train.std = finaldf.train.std[!narows, ]

# K-means clustering

k=7
set.seed(123)
kmClustering = kmeans(finaldf.train.std, centers = k)

# Show Cluster membership

s = kmClustering$cluster

# Cluster statistics

kmClustering$centers
##         hotel  is_canceled  lead_time stays_in_weekend_nights
## 1  1.16374757 -0.105647455  0.4176796              1.51111170
## 2 -0.40946745 -0.582914641 -0.2059924             -0.05346063
## 3  0.66277133 -0.681514538 -0.5946749             -0.27289318
## 4  0.01563604 -0.759758988 -0.4471041             -0.28067558
## 5  0.33522331  0.005767307 -0.1379478              0.21627090
## 6 -0.51865283  0.909807385  2.0019031             -0.39834349
## 7 -0.29783452  1.306279525 -0.1089238             -0.13787909
##   stays_in_week_nights      adults   children      babies is_repeated_guest
## 1            1.8264433  0.14678106 -0.1832751 -0.07007546        -0.1614260
## 2           -0.1256073  0.48264076 -0.1434408 -0.07958246        -0.1813142
## 3           -0.3307525 -0.28591340 -0.1466892 -0.03802210         1.9490224
## 4           -0.3598094 -0.44526020 -0.2223697 -0.07968009        -0.1813142
## 5            0.2210426  0.19090650  3.5032865  1.37204260        -0.1572152
## 6           -0.2173627  0.08307433 -0.2505135 -0.08010403        -0.1016734
## 7           -0.1900797  0.01336179 -0.1982927 -0.08010403        -0.1813142
##   previous_cancellations booking_changes days_in_waiting_list        adr
## 1            -0.07524925      0.24424879          -0.11731732 -0.1405052
## 2            -0.09678149     -0.06457517          -0.12370772  0.3296197
## 3             0.08688877      0.14148488          -0.12709505 -0.2164521
## 4            -0.10059071      0.09659335          -0.09278039 -0.3701471
## 5            -0.09547695      0.38369645          -0.12888438  1.4470065
## 6             0.61588641     -0.21136720           1.08881070 -0.4017119
## 7            -0.03301575     -0.21919510          -0.11280256  0.0905418
##   required_car_parking_spaces total_of_special_requests
## 1                  -0.2063743                0.05352444
## 2                  -0.2520843                1.16071374
## 3                   2.5712547                0.20207500
## 4                  -0.2546143               -0.49924713
## 5                   0.1833781                0.38168342
## 6                  -0.2537052               -0.58176909
## 7                  -0.2546143               -0.44444007
kmClustering$size
## [1] 10025 19276  7645 23715  4964  8941 20945
# Define a function to assign probabilities based on cluster distance
assign_probabilities <- function(data, kmClustering) {
  # Calculate distances to cluster centroids
  distances <- stats::dist(rbind(data, kmClustering$centers))
  # Assign probabilities based on the inverse of distances
  probabilities <- 1 / apply(as.matrix(distances), 1, min)
  # Normalize probabilities
  probabilities <- probabilities / sum(probabilities)
  return(probabilities)
}

# Apply the function to the training data
# train_probabilities <- assign_probabilities(finaldf.train.std, kmClustering)

# Create ROC curve object
# roc_obj <- roc(finaldf.train.std$is_canceled, train_probabilities)

# Plot ROC curve
# plot(roc_obj, main = "ROC Curve for K-means Clustering",
     
# xlab = "False Positive Rate", ylab = "True Positive Rate")

# Add AUC value to the plot
# auc_val <- auc(roc_obj)
# legend("bottomright", legend = paste("AUC =", round(auc_val, 2)), bty = "n")

# Print AUC value
# print(paste("AUC:", round(auc_val, 2)))
# Plot an empty scatter plot

plot(c(0), xaxt = 'n', ylab = "", type = "l", ylim = c(min(kmClustering$centers), max(kmClustering$centers)), xlim = c(0,8))

# label x-axis

axis(1, at = c(1:8), labels = names(finaldf)[1:8])

# Plot centroids

for (i in c(1:k)) lines(kmClustering$centers[i,], lty = i, lwd = 2, col = ifelse( i %in% c(1,3,5), "black", "dark grey"))

# Name Clusters

text(x = 0.5, y = kmClustering$cebters[,1], labels = paste("Cluster", c(1:k)))

# Determining the optimal value of k

# Calculate total within-cluster sum of squares (WSS) for different values of k
wss <- numeric(10)
for (i in 1:10) {
  kmClustering <- kmeans(finaldf.train.std, centers = i)
  wss[i] <- sum(kmClustering$withinss)
}

# Plot the elbow chart
plot(1:10, wss, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of clusters (k)", ylab = "Total within-cluster sum of squares (WSS)",
     main = "Elbow Method for Optimal k")

# Add a line to indicate the 'elbow'
elbow_point <- which(diff(wss) < mean(diff(wss)))[1]
points(elbow_point, wss[elbow_point], col = "red", cex = 2, pch = 19)
text(elbow_point, wss[elbow_point], labels = c("Elbow Point"), pos = 3, col = "red")