An Exploratory Data Analysis on Airline Customer Satisfaction

Parv Bhargava, Jehan Bugli, Venkata Madisetty, and Namratha Prakash

2023-12-16

Introduction

Airline passenger satisfaction is a crucial metric for firms in the airline industry. Understanding the factors that contribute to customer satisfaction is essential for airlines to improve their services and compete effectively; high market saturation, as well as low profit margins, can magnify the effects of small advantages or disadvantages relative to other firms (Lutz et al., 2012; Hardee, 2023). In this research, we will analyze various factors that affect airline passenger satisfaction and, ultimately, judge their suitability for a regression model predicting passenger satisfaction. We will leverage a Kaggle dataset that includes surveyed passenger characteristics, flight details, and satisfaction ratings for select pre-flight and in-flight components (Klein, 2020). To ensure modeling suitability, we will conduct exploratory data analysis, taking into account variable distributions and types.

With our research, we aim to answer a few main questions. For one, to what extent do certain surveyed passenger characteristics and flight experience components impact the likelihood that a passenger will be satisfied – rather than neutral or dissatisfied – with their trip? This is the key focus of our research; we want to identify meaningful inputs for satisfaction and estimate the magnitude of their effects. Secondly, how can we model the likelihood of passenger satisfaction using surveyed passenger characteristics and flight experience components in a manner that minimizes predictive bias? While assembling our models, we need to ensure that issues such as multicollinearity and overfitting do not jeopardize our models’ predictive validity. Finally, to what extent can we predict the likelihood that a flight passenger will be satisfied with their experience using multiple different variable levels? Our dataset utilizes continuous, ordinal, and categorical variables, all of which can require differing assumptions when used in modeling; incorporating these different variable levels into a model is an important step in predicting satisfaction.

The dataset for our research on airline passenger satisfaction contains various variables, which can be categorized into three types: continuous, categorical, and ordinal. Continuous variables include passenger age, flight distance, arrival delays, and departure delays. Categorical variables include gender, customer type (loyalty), the type of travel (business or personal), and the travel class (business, economy, or economy plus). Ordinal variables include a number of ratings from 0-5 concerning specific aspects of the flight experience. The “Satisfaction” variable represents the airline passenger’s satisfaction level and includes two categories: “satisfied” or “neutral or dissatisfied.” This will be our primary outcome variable for analysis.

Variable limitations

While the analysis and insight generation opportunities are manyfold, certain fields in this dataset can present challenges limiting a resulting model’s predictive validity. One critical issue is data collection; While some variable-related documentation is available, we are not able to discern the circumstances under which this survey was distributed using the Kaggle source (Klein, 2020). The population may have been sampled through certain methods—such as convenience sampling—that make resulting data less representative of the overall population despite the large observation count. The overall population in question also is not clear; the survey may have focused on a particular airport or region, limiting potential predictive validity in alternative settings.

Another issue is that the document does not elaborate upon what counts as a “loyal” or “disloyal” customer for the customer type field. This makes it difficult to properly interpret the effects of such a variable in a regression model. The threshold for disloyalty could potentially range from using any other airlines at all to using other airlines a majority of the time, drastically altering any potential real-world applications.

A third—but not final—problematic factor is that ticket prices are not included in this survey, with class serving as a rough proxy; intuitively, such prices could play a major factor in passengers’ service expectations and their subsequent ratings. The lack of price ranges associated with seat class also makes it difficult to encode the three categories in a way that accurately captures the disparity.

Head
X	id	Gender	Customer.Type	Age	Type.of.Travel	Class	Flight.Distance	Inflight.wifi.service	Departure.Arrival.time.convenient	Ease.of.Online.booking	Gate.location	Food.and.drink	Online.boarding	Seat.comfort	Inflight.entertainment	On.board.service	Leg.room.service	Baggage.handling	Checkin.service	Inflight.service	Cleanliness	Departure.Delay.in.Minutes	Arrival.Delay.in.Minutes	satisfaction
0	70172	Male	Loyal Customer	13	Personal Travel	Eco Plus	460	3	4	3	1	5	3	5	5	4	3	4	4	5	5	25	18	neutral or dissatisfied
1	5047	Male	disloyal Customer	25	Business travel	Business	235	3	2	3	3	1	3	1	1	1	5	3	1	4	1	1	6	neutral or dissatisfied
2	110028	Female	Loyal Customer	26	Business travel	Business	1142	2	2	2	2	5	5	5	5	4	3	4	4	4	5	0	0	satisfied
3	24026	Female	Loyal Customer	25	Business travel	Business	562	2	5	5	5	2	2	2	2	2	5	3	1	4	2	11	9	neutral or dissatisfied
4	119299	Male	Loyal Customer	61	Business travel	Business	214	3	3	3	3	4	5	5	3	3	4	4	3	3	3	0	0	satisfied

Data structure

## 'data.frame':    103904 obs. of  25 variables:
##  $ X                                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id                               : int  70172 5047 110028 24026 119299 111157 82113 96462 79485 65725 ...
##  $ Gender                           : chr  "Male" "Male" "Female" "Female" ...
##  $ Customer.Type                    : chr  "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
##  $ Age                              : int  13 25 26 25 61 26 47 52 41 20 ...
##  $ Type.of.Travel                   : chr  "Personal Travel" "Business travel" "Business travel" "Business travel" ...
##  $ Class                            : chr  "Eco Plus" "Business" "Business" "Business" ...
##  $ Flight.Distance                  : int  460 235 1142 562 214 1180 1276 2035 853 1061 ...
##  $ Inflight.wifi.service            : int  3 3 2 2 3 3 2 4 1 3 ...
##  $ Departure.Arrival.time.convenient: int  4 2 2 5 3 4 4 3 2 3 ...
##  $ Ease.of.Online.booking           : int  3 3 2 5 3 2 2 4 2 3 ...
##  $ Gate.location                    : int  1 3 2 5 3 1 3 4 2 4 ...
##  $ Food.and.drink                   : int  5 1 5 2 4 1 2 5 4 2 ...
##  $ Online.boarding                  : int  3 3 5 2 5 2 2 5 3 3 ...
##  $ Seat.comfort                     : int  5 1 5 2 5 1 2 5 3 3 ...
##  $ Inflight.entertainment           : int  5 1 5 2 3 1 2 5 1 2 ...
##  $ On.board.service                 : int  4 1 4 2 3 3 3 5 1 2 ...
##  $ Leg.room.service                 : int  3 5 3 5 4 4 3 5 2 3 ...
##  $ Baggage.handling                 : int  4 3 4 3 4 4 4 5 1 4 ...
##  $ Checkin.service                  : int  4 1 4 1 3 4 3 4 4 4 ...
##  $ Inflight.service                 : int  5 4 4 4 3 4 5 5 1 3 ...
##  $ Cleanliness                      : int  5 1 5 2 3 1 2 4 2 2 ...
##  $ Departure.Delay.in.Minutes       : int  25 1 0 11 0 0 9 4 0 0 ...
##  $ Arrival.Delay.in.Minutes         : num  18 6 0 9 0 0 23 0 0 0 ...
##  $ satisfaction                     : chr  "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...

Data dimensions

This is a data frame with 103904 observations (rows) and 25 variables (columns). Assuming that a robust sampling method was utilized, the large number of observations may allow us to conclude that the data is generally representative of the actual population.

An initial description of the data

## data 
## 
##  25  Variables      103904  Observations
## ------------------------------------------------------------
## X 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0   103904        1    51952    34635 
##      .05      .10      .25      .50      .75      .90 
##     5195    10390    25976    51952    77927    93513 
##      .95 
##    98708 
## 
## lowest :      0      1      2      3      4
## highest: 103899 103900 103901 103902 103903
## ------------------------------------------------------------
## id 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0   103904        1    64924    43260 
##      .05      .10      .25      .50      .75      .90 
##     6593    13044    32534    64857    97368   116884 
##      .95 
##   123410 
## 
## lowest :      1      2      3      4      5
## highest: 129874 129875 129878 129879 129880
## ------------------------------------------------------------
## Gender 
##        n  missing distinct 
##   103904        0        2 
##                         
## Value      Female   Male
## Frequency   52727  51177
## Proportion  0.507  0.493
## ------------------------------------------------------------
## Customer.Type 
##        n  missing distinct 
##   103904        0        2 
##                                               
## Value      disloyal Customer    Loyal Customer
## Frequency              18981             84923
## Proportion             0.183             0.817
## ------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0       75        1    39.38    17.32 
##      .05      .10      .25      .50      .75      .90 
##       14       20       27       40       51       59 
##      .95 
##       64 
## 
## lowest :  7  8  9 10 11, highest: 77 78 79 80 85
## ------------------------------------------------------------
## Type.of.Travel 
##        n  missing distinct 
##   103904        0        2 
##                                           
## Value      Business travel Personal Travel
## Frequency            71655           32249
## Proportion            0.69            0.31
## ------------------------------------------------------------
## Class 
##        n  missing distinct 
##   103904        0        3 
##                                      
## Value      Business      Eco Eco Plus
## Frequency     49665    46745     7494
## Proportion    0.478    0.450    0.072
## ------------------------------------------------------------
## Flight.Distance 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0     3802        1     1189     1066 
##      .05      .10      .25      .50      .75      .90 
##      175      236      414      843     1743     2750 
##      .95 
##     3383 
## 
## lowest :   31   56   67   73   74, highest: 4243 4502 4817 4963 4983
## ------------------------------------------------------------
## Inflight.wifi.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.956     2.73    1.492 
##                                               
## Value          0     1     2     3     4     5
## Frequency   3103 17840 25830 25868 19794 11469
## Proportion 0.030 0.172 0.249 0.249 0.191 0.110
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Arrival.time.convenient 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.962     3.06    1.716 
##                                               
## Value          0     1     2     3     4     5
## Frequency   5300 15498 17191 17966 25546 22403
## Proportion 0.051 0.149 0.165 0.173 0.246 0.216
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Ease.of.Online.booking 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.961    2.757    1.578 
##                                               
## Value          0     1     2     3     4     5
## Frequency   4487 17525 24021 24449 19571 13851
## Proportion 0.043 0.169 0.231 0.235 0.188 0.133
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Gate.location 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.952    2.977    1.437 
##                                               
## Value          0     1     2     3     4     5
## Frequency      1 17562 19459 28577 24426 13879
## Proportion 0.000 0.169 0.187 0.275 0.235 0.134
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Food.and.drink 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.956    3.202    1.499 
##                                               
## Value          0     1     2     3     4     5
## Frequency    107 12837 21988 22300 24359 22313
## Proportion 0.001 0.124 0.212 0.215 0.234 0.215
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Online.boarding 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.951     3.25    1.501 
##                                               
## Value          0     1     2     3     4     5
## Frequency   2428 10692 17505 21804 30762 20713
## Proportion 0.023 0.103 0.168 0.210 0.296 0.199
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Seat.comfort 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.945    3.439    1.462 
##                                               
## Value          0     1     2     3     4     5
## Frequency      1 12075 14897 18696 31765 26470
## Proportion 0.000 0.116 0.143 0.180 0.306 0.255
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.entertainment 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6     0.95    3.358     1.49 
##                                               
## Value          0     1     2     3     4     5
## Frequency     14 12478 17637 19139 29423 25213
## Proportion 0.000 0.120 0.170 0.184 0.283 0.243
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## On.board.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.947    3.382    1.433 
##                                               
## Value          0     1     2     3     4     5
## Frequency      3 11872 14681 22833 30867 23648
## Proportion 0.000 0.114 0.141 0.220 0.297 0.228
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Leg.room.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6     0.95    3.351    1.471 
##                                               
## Value          0     1     2     3     4     5
## Frequency    472 10353 19525 20098 28789 24667
## Proportion 0.005 0.100 0.188 0.193 0.277 0.237
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Baggage.handling 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        5    0.926    3.632    1.282 
##                                         
## Value          1     2     3     4     5
## Frequency   7237 11521 20632 37383 27131
## Proportion 0.070 0.111 0.199 0.360 0.261
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Checkin.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.946    3.304    1.408 
##                                               
## Value          0     1     2     3     4     5
## Frequency      1 12890 12893 28446 29055 20619
## Proportion 0.000 0.124 0.124 0.274 0.280 0.198
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.924     3.64    1.274 
##                                               
## Value          0     1     2     3     4     5
## Frequency      3  7084 11457 20299 37945 27116
## Proportion 0.000 0.068 0.110 0.195 0.365 0.261
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Cleanliness 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.953    3.286    1.471 
##                                               
## Value          0     1     2     3     4     5
## Frequency     12 13318 16132 24574 27179 22689
## Proportion 0.000 0.128 0.155 0.237 0.262 0.218
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Delay.in.Minutes 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0      446     0.82    14.82    24.68 
##      .05      .10      .25      .50      .75      .90 
##        0        0        0        0       12       44 
##      .95 
##       78 
## 
## lowest :    0    1    2    3    4, highest:  933  978 1017 1305 1592
## ------------------------------------------------------------
## Arrival.Delay.in.Minutes 
##        n  missing distinct     Info     Mean      Gmd 
##   103594      310      455    0.823    15.18    25.15 
##      .05      .10      .25      .50      .75      .90 
##        0        0        0        0       13       44 
##      .95 
##       79 
## 
## lowest :    0    1    2    3    4, highest:  952  970 1011 1280 1584
## ------------------------------------------------------------
## satisfaction 
##        n  missing distinct 
##   103904        0        2 
##                                                           
## Value      neutral or dissatisfied               satisfied
## Frequency                    58879                   45025
## Proportion                   0.567                   0.433
## ------------------------------------------------------------

Data pre-processing

Duplicate values

We first imported the data into R by using read.csv() function. The first few rows in the dataset are included above. This is a data frame with 103904 observations (rows) and 25 variables (columns). Assuming that a robust sampling method was utilized, the large number of observations may allow us to conclude that the data is generally representative of the actual population. However, the data required some cleaning before use in testing. One issue was that the arrival delays field included a number of NA values; we elected to replace these with the median delay. This method was used over other potential replacement options, such as the average, due to the skewed distribution of values detailed later on. Apart from that, ratings responses equaling 0 indicate that the question was not applicable; respondents that select this option for any of the ratings variables are filtered out to ensure that all of the individual ratings are relevant for all observations. While alternatives exist, such as replacement, the large number of initial observations limited our concerns over a potential loss in predictive validity. All steps were repeated for both the training and testing datasets.

Examining variable distributions

Following data pre-processing, we plotted variable distributions to attempt to identify potential trends and correlations. Given a robust sampling method, we can safely assume that these distributions (including the highly skewed ones) are representative of the overall population. Initially, none of the categorical fields appear to be highly correlated, but we intend to confirm this using variance inflation factor (VIF) analysis following initial model creation (“vif: Variance Inflation Factors”, n.d.). Looking at the distribution of class, Eco Plus has a significantly lower observation frequency than the other two. In addition, as noted earlier, the magnitudes of increments between Eco, Eco Plus, and Business are not clear; we noted that some transformation may be required later to ensure modeling suitability.

When plotting continuous variable distributions, flight distance as well as both delay variables have a strong right skew. This makes sense intuitively; we would expect most flights to have minimal to no delays, and shorter flights are likely more frequent. Age appears to be bimodal to a degree, with a small peak around 20-25 and another peak roughly around 35-50. Depending on the type of regression that is ultimately selected, some of these variables may require aggressive transformations to better approximate normal distributions. Many of the distributions for individual ratings variables look quite similar, raising multicollinearity concerns that will be addressed later.

Frequency distributions for categorical variables

Frequency distributions for continuous variables

Frequency distributions for ordinal variables (Ratings)

Distributions with respect to satisfaction

We also used plots to visually discern differences in continuous variables between satisfied and unsatisfied groups, potentially revealing significant model inputs. The first step was to use box-plots for continuous variables. We found that older passengers tend to be more satisfied with their flights compared to their younger counterparts. Also, on average, passengers who embark on longer journeys tend to report higher levels of satisfaction. The basis for this trend is unclear at this time, but further investigation may yield actionable conclusions in this regard. Flights experiencing greater departure delays appear to have a slightly higher proportion of neutral or dissatisfied customers, which supports the intuition that prolonged delays before takeoff may negatively affect passenger contentment. Similarly to departure delays, flights with higher arrival delays tend to exhibit a marginally increased prevalence of neutral or dissatisfied customers. This underscores the potential impact of delays—both at departure and arrival—on passenger satisfaction, although more investigation was required to uncover the exact nature of this relationship. A scatterplot uncovered potential multicollinearity concerns to be addressed later.

Histograms for categorical variables uncovered a distinct trend in terms of customer loyalty. Loyal customers, those who have a history of repeat business with the airline, tend to report higher levels of satisfaction compared to disloyal or infrequent flyers. There is a significant satisfaction discrepancy between individuals traveling for business and personal reasons; a majority of business travelers were satisfied, while an overwhelming proportion of personal travelers expressed dissatisfaction or neutrality. The nature of this relationship, as well as actionable insights that may be drawn from it, are unclear at this point. Business class passengers stand out as notably more satisfied than those in Economy or Economy Plus. If proven to be statistically significant, this factor could spur class-specific service and amenity adjustments for efficient satisfaction gains. It might also warrant future study detailing meaningful distinctions in the flight experience between classes. The notable exception here is gender, across which there were no notable differences in satisfaction.

Continuous variable boxplots

Categorical variable histograms

Continuous variable KDE (Kernel Density Estimation) plots

Arrival and departure delay scatterplot

Correlation matrices

Our final EDA step was to example multicollinearity; to accomplish this, we built two correlation matrices for continuous and ordinal (ratings) variables respectively. As observed earlier, arrival and departure delays appear to be highly correlated; certain steps, such as removing one of the two or calculating an average delay variable, would likely be necessary for use in a predictive model. We also found that certain ratings variables have strong positive correlations with each other. If these are included in the model without adjustments, our model may suffer a loss in reliability. In order to avoid this issue, we elected to combine ratings variables into two groups—based on the degree of correlation—and utilize average ratings from these two groups as model inputs.

Continuous variable correlations

##       Age         Flight.Distance 
##  Min.   :-0.016   Min.   :-0.004  
##  1st Qu.:-0.014   1st Qu.:-0.001  
##  Median : 0.035   Median : 0.042  
##  Mean   : 0.264   Mean   : 0.270  
##  3rd Qu.: 0.312   3rd Qu.: 0.312  
##  Max.   : 1.000   Max.   : 1.000  
##  Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
##  Min.   :-0.013             Min.   :-0.016          
##  1st Qu.:-0.003             1st Qu.:-0.007          
##  Median : 0.480             Median : 0.478          
##  Mean   : 0.487             Mean   : 0.485          
##  3rd Qu.: 0.970             3rd Qu.: 0.970          
##  Max.   : 1.000             Max.   : 1.000

Ratings variable correlations

Aggregated ratings variable inclusions and summary statistics

Ratings Group 1: Pre-Flight & Wi-Fi	Ratings Group 2: In-Flight & Baggage
In-Flight Wifi Service	Food and Drink
Departure / Arrival Time	Seat Comfort
Ease of Online Booking	In-Flight Entertainment
Gate Location	Onboard Service
Online Boarding	Leg Room Service
	Baggage Handling
	Check-In Service
	In-Flight Service
	Cleanliness

##  Pre_Flight_and_WiFi_Ratings In_Flight_and_Baggage_Ratings
##  Min.   :1.00                Min.   :1.11                 
##  1st Qu.:2.40                1st Qu.:2.78                 
##  Median :3.00                Median :3.44                 
##  Mean   :3.04                Mean   :3.41                 
##  3rd Qu.:3.80                3rd Qu.:4.00                 
##  Max.   :5.00                Max.   :5.00

Probability and standard OLS estimates

Before engaging in further analysis, we first identified that satisfaction—as a categorical/binary variable—runs into a fundamental interpretation issue under a standard linear model, where the standard linear model is not bounded between 0 and 1 in the same manner as our satisfaction variable. Under certain inputs, the linear model predicts unattainable values between satisfied or neutral/dissatisfied (encoded as 1 and 0 respectively), and key assumptions of linearity and homoskedasticity are violated.

Despite this restriction, linear probability models remain in widespread use, particularly among social scientists, making this a potentially fruitful avenue for a predictive model (Allison, 2015). This largely stems from ease of interpretation and generation; unlike logit (to be discussed later), this directly predicts changes in probability rather than odds ratios, is easier to run, and approximates logit for the 0.2-0.8 probability range in most cases (Allison, 2020). We generated a linear model and used a t-test with robust standard errors to account for violated homoskedasticity assumptions.

## 
## Call:
## lm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel + 
##     Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings + 
##     Arrival.Delay.in.Minutes, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0817 -0.2223  0.0047  0.1975  1.4188 
## 
## Coefficients:
##                                Estimate Std. Error t value
## (Intercept)                   -1.29e+00   6.59e-03 -196.47
## Gender                        -3.76e-04   2.13e-03   -0.18
## Customer.Type                  3.57e-01   3.39e-03  105.29
## Age                            1.72e-04   7.43e-05    2.31
## Type.of.Travel                 4.36e-01   3.08e-03  141.47
## Class                          1.24e-01   2.95e-03   42.12
## Flight.Distance                5.98e-06   1.24e-06    4.83
## Pre_Flight_and_WiFi_Ratings    9.04e-02   1.18e-03   76.51
## In_Flight_and_Baggage_Ratings  2.28e-01   1.46e-03  156.62
## Arrival.Delay.in.Minutes      -4.61e-04   2.75e-05  -16.74
##                               Pr(>|t|)    
## (Intercept)                    < 2e-16 ***
## Gender                           0.860    
## Customer.Type                  < 2e-16 ***
## Age                              0.021 *  
## Type.of.Travel                 < 2e-16 ***
## Class                          < 2e-16 ***
## Flight.Distance                1.4e-06 ***
## Pre_Flight_and_WiFi_Ratings    < 2e-16 ***
## In_Flight_and_Baggage_Ratings  < 2e-16 ***
## Arrival.Delay.in.Minutes       < 2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.33 on 95694 degrees of freedom
## Multiple R-squared:  0.555,  Adjusted R-squared:  0.555 
## F-statistic: 1.33e+04 on 9 and 95694 DF,  p-value: <2e-16

## 
## t test of coefficients:
## 
##                                Estimate Std. Error t value
## (Intercept)                   -1.29e+00   5.75e-03 -225.32
## Gender                        -3.76e-04   2.14e-03   -0.18
## Customer.Type                  3.57e-01   3.89e-03   91.83
## Age                            1.72e-04   7.58e-05    2.27
## Type.of.Travel                 4.36e-01   3.39e-03  128.45
## Class                          1.24e-01   3.35e-03   37.12
## Flight.Distance                5.98e-06   1.22e-06    4.90
## Pre_Flight_and_WiFi_Ratings    9.04e-02   1.25e-03   72.10
## In_Flight_and_Baggage_Ratings  2.28e-01   1.53e-03  149.51
## Arrival.Delay.in.Minutes      -4.61e-04   3.06e-05  -15.06
##                               Pr(>|t|)    
## (Intercept)                    < 2e-16 ***
## Gender                           0.860    
## Customer.Type                  < 2e-16 ***
## Age                              0.023 *  
## Type.of.Travel                 < 2e-16 ***
## Class                          < 2e-16 ***
## Flight.Distance                9.8e-07 ***
## Pre_Flight_and_WiFi_Ratings    < 2e-16 ***
## In_Flight_and_Baggage_Ratings  < 2e-16 ***
## Arrival.Delay.in.Minutes       < 2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on our linear model, all inputs apart from gender and age have statistically significant impacts on satisfaction likelihood. As mentioned earlier, one major advantage from the linear model is that coefficients can be easily interpreted. For instance, loyal customers display a 0.357 (35.7%) increase in predicted satisfaction probability relative to others. In a similar vein, the model predicts a 43.5% higher satisfaction probability for passengers traveling for business relative to others. For the non-binary aggregated ratings, a 1-point increase corresponds to 9.07% and 22.9% predicted satisfaction probability increases for the pre-flight and in-flight groups respectively.

However, to confirm that the linear model is indeed a practically valuable predictor, we can’t rely solely on the dataset used for training; our source provides a second testing dataset for which we can repeat cleaning/encoding steps and apply our model. Since gender and age are not significant, we elected to remove them prior to this step (marking this as a “v2” model). Using a confusion matrix, we determined that the v2 model’s “accuracy”—the proportion of correctly predicted satisfaction values out of all respondents—is over 80% for the testing dataset. Based on this information, we can conclude that the linear model is a reasonably good predictor that isn’t overfitting the training data.

## 
## Call:
## lm(formula = satisfaction ~ Customer.Type + Type.of.Travel + 
##     Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings + 
##     Arrival.Delay.in.Minutes, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0819 -0.2222  0.0049  0.1976  1.4174 
## 
## Coefficients:
##                                Estimate Std. Error t value
## (Intercept)                   -1.29e+00   6.24e-03 -206.66
## Customer.Type                  3.59e-01   3.28e-03  109.58
## Type.of.Travel                 4.36e-01   3.07e-03  142.36
## Class                          1.25e-01   2.95e-03   42.27
## Flight.Distance                5.88e-06   1.24e-06    4.75
## Pre_Flight_and_WiFi_Ratings    9.05e-02   1.18e-03   76.57
## In_Flight_and_Baggage_Ratings  2.28e-01   1.46e-03  156.68
## Arrival.Delay.in.Minutes      -4.61e-04   2.75e-05  -16.77
##                               Pr(>|t|)    
## (Intercept)                     <2e-16 ***
## Customer.Type                   <2e-16 ***
## Type.of.Travel                  <2e-16 ***
## Class                           <2e-16 ***
## Flight.Distance                  2e-06 ***
## Pre_Flight_and_WiFi_Ratings     <2e-16 ***
## In_Flight_and_Baggage_Ratings   <2e-16 ***
## Arrival.Delay.in.Minutes        <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.33 on 95696 degrees of freedom
## Multiple R-squared:  0.555,  Adjusted R-squared:  0.555 
## F-statistic: 1.7e+04 on 7 and 95696 DF,  p-value: <2e-16

##    
##         0     1
##   0 11937  1653
##   1  1557  8716

Accuracy: 0.865

However, it is not yet clear that a linear model would be the best predictor available. Logistic regression, which predicts the log odds of satisfaction. is the dominant approach for modeling binary variables (Allison, 2015). Logistic regression models utilize different assumptions relative to linear models, significantly altering the necessary EDA steps. Rather than a linear relationship between parameters and the dependent variable, logistic regression assumes a linear relationship between parameters and the log odds. Independence of errors and multicollinearity remain as assumptions for both linear and logistic models. Homoskedasticity and normally distributed residuals are both not required under logistic regression (“Assumptions of Logistic Regression”, n.d.).

Unlike a standard linear regression, which assumes that independent parameters have a linear relationship with the dependent variable, logistic regression assumes that parameters have a linear relationship with the log odds (“Assumptions of Logistic Regression”, n.d.).

Odds represent the number of favorable outcomes divided by the number of unfavorable outcomes. Put differently, if “p” represents the probability of favorable outcomes, Odds = p/(1-p). Log odds take the natural log of the odds, which can be expressed as ln(p/1-p)) (Agarwal, 2019). We used visual test to examine whether or not this assumption holds true for continuous variables. While it is not sensible to compute log odds for individual data points, we grouped continuous variables into discrete buckets—calculating the average log odds for each—to examine whether or not they might satisfy this assumption.

Only flight distance, as well as in-flight and baggage ratings, displayed roughly linear relationships with log odds of satisfaction in our testing. Age appeared to have a parabolic relationship, peaking in the middle, indicating some sort of aggressive transformation method may be necessary to reach a linear relationship. Meanwhile, log odds for both delay statistics quickly dispersed in both directions as they increase (likely in part due to the limited frequency of higher durations), making it difficult to conclude with certainty that a linear relationship exists. Pre-flight and wi-fi ratings appear to have a significantly looser connection relative to in-flight ratings with a potential dip in log odds for average ratings.

Building and testing a logit model

Testing linearity with log odds

Following visual testing, we generated a logit model in order to examine potential differences relative to the prior linear model. Rather than starting with a pared-down variable list, we returned to an expanded variable list to see if there were any distinctions in what the models deemed statistically significant. This proved to be informative; alongside gender and age, flight distance also failed to reach the threshold for statistical significance.

logit_model = glm(satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel + Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings, data = data, family = "binomial")

summary(logit_model)

## 
## Call:
## glm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel + 
##     Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings, 
##     family = "binomial", data = data)
## 
## Coefficients:
##                                Estimate Std. Error z value
## (Intercept)                   -1.47e+01   9.86e-02 -148.85
## Gender                         5.79e-03   2.06e-02    0.28
## Customer.Type                  2.48e+00   3.19e-02   77.57
## Age                            7.10e-04   7.42e-04    0.96
## Type.of.Travel                 3.33e+00   3.24e-02  102.75
## Class                          8.32e-01   2.56e-02   32.53
## Flight.Distance                1.45e-05   1.18e-05    1.23
## Pre_Flight_and_WiFi_Ratings    8.30e-01   1.23e-02   67.58
## In_Flight_and_Baggage_Ratings  1.96e+00   1.67e-02  116.80
##                               Pr(>|z|)    
## (Intercept)                     <2e-16 ***
## Gender                            0.78    
## Customer.Type                   <2e-16 ***
## Age                               0.34    
## Type.of.Travel                  <2e-16 ***
## Class                           <2e-16 ***
## Flight.Distance                   0.22    
## Pre_Flight_and_WiFi_Ratings     <2e-16 ***
## In_Flight_and_Baggage_Ratings   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130562  on 95703  degrees of freedom
## Residual deviance:  61179  on 95695  degrees of freedom
## AIC: 61197
## 
## Number of Fisher Scoring iterations: 6

In order to compare this with the linear model, we generated another confusion matrix based on the testing data. In a similar fashion to the linear model, we created a “v2” model removing statistically insignificant inputs. The accuracy results were better than those of the linear model, but only slightly; it isn’t clear whether this marginal improvement would hold true given further testing with different survey data. The calculated McFadden pseudo-R^2 falls above 0.5.

## 
## Call:
## glm(formula = satisfaction ~ Customer.Type + Type.of.Travel + 
##     Class + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings, 
##     family = "binomial", data = data)
## 
## Coefficients:
##                               Estimate Std. Error z value
## (Intercept)                   -14.6554     0.0965  -151.9
## Customer.Type                   2.4940     0.0298    83.7
## Type.of.Travel                  3.3316     0.0321   103.8
## Class                           0.8450     0.0236    35.9
## Pre_Flight_and_WiFi_Ratings     0.8302     0.0123    67.6
## In_Flight_and_Baggage_Ratings   1.9558     0.0167   116.9
##                               Pr(>|z|)    
## (Intercept)                     <2e-16 ***
## Customer.Type                   <2e-16 ***
## Type.of.Travel                  <2e-16 ***
## Class                           <2e-16 ***
## Pre_Flight_and_WiFi_Ratings     <2e-16 ***
## In_Flight_and_Baggage_Ratings   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130562  on 95703  degrees of freedom
## Residual deviance:  61181  on 95698  degrees of freedom
## AIC: 61193
## 
## Number of Fisher Scoring iterations: 6

##    
##         0     1
##   0 12635   955
##   1  2189  8084

## [1] "Accuracy: 0.868"

## [1] "McFadden R^2: 0.531"

Constructing the initial logit model

## 
## Call:
## glm(formula = satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service + 
##     Ease.of.Online.booking + Online.boarding + Seat.comfort + 
##     Inflight.entertainment + On.board.service + Leg.room.service + 
##     Baggage.handling + Checkin.service + Inflight.service + Cleanliness + 
##     Arrival.Delay.in.Minutes, family = binomial(), data = data)
## 
## Coefficients:
##                           Estimate Std. Error z value
## (Intercept)              -1.34e+01   9.93e-02 -134.91
## Age                       1.20e-02   7.83e-04   15.33
## Type.of.Travel            2.35e+00   3.28e-02   71.69
## Class                     1.26e+00   2.66e-02   47.32
## Inflight.wifi.service     6.25e-01   1.29e-02   48.35
## Ease.of.Online.booking   -4.75e-02   1.11e-02   -4.27
## Online.boarding           1.03e+00   1.22e-02   84.35
## Seat.comfort              2.83e-02   1.26e-02    2.25
## Inflight.entertainment    3.12e-01   1.48e-02   21.16
## On.board.service          3.06e-01   1.14e-02   26.74
## Leg.room.service          3.62e-01   9.86e-03   36.74
## Baggage.handling          5.58e-02   1.27e-02    4.40
## Checkin.service           2.49e-01   9.39e-03   26.54
## Inflight.service          1.88e-02   1.34e-02    1.41
## Cleanliness               1.10e-01   1.27e-02    8.67
## Arrival.Delay.in.Minutes -3.87e-03   2.83e-04  -13.68
##                          Pr(>|z|)    
## (Intercept)               < 2e-16 ***
## Age                       < 2e-16 ***
## Type.of.Travel            < 2e-16 ***
## Class                     < 2e-16 ***
## Inflight.wifi.service     < 2e-16 ***
## Ease.of.Online.booking    1.9e-05 ***
## Online.boarding           < 2e-16 ***
## Seat.comfort                0.024 *  
## Inflight.entertainment    < 2e-16 ***
## On.board.service          < 2e-16 ***
## Leg.room.service          < 2e-16 ***
## Baggage.handling          1.1e-05 ***
## Checkin.service           < 2e-16 ***
## Inflight.service            0.159    
## Cleanliness               < 2e-16 ***
## Arrival.Delay.in.Minutes  < 2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130562  on 95703  degrees of freedom
## Residual deviance:  54781  on 95688  degrees of freedom
## AIC: 54813
## 
## Number of Fisher Scoring iterations: 6

Observations

Our initial model yields some interesting observations. For one, most variables have p-values less than 0.05, indicating they significantly influence the dependent variable. Positive coefficients (e.g., ‘Age’, ‘Type of Travel’) suggest a positive relationship with the outcome, whereas negative coefficients (e.g., ‘Ease of Online booking’, ‘Arrival Delay in Minutes’) indicate a negative relationship. Variables with larger coefficients and small standard errors, like ‘Online boarding’ and ‘Type of Travel’, may have a more substantial impact on the outcome. The large difference between the null and residual deviance suggests a good model fit. Some variables, like ‘Inflight service’, do not show statistical significance, implying a weaker or no influence on the dependent variable. Finally, the model seems capable of predicting the outcome effectively, given the significance and size of most coefficients.

##    log_pred_class
##         0     1
##   0 12151  1439
##   1  1445  8828

##                      Age           Type.of.Travel 
##                     1.06                     1.42 
##                    Class    Inflight.wifi.service 
##                     1.47                     1.85 
##   Ease.of.Online.booking          Online.boarding 
##                     1.68                     1.27 
##             Seat.comfort   Inflight.entertainment 
##                     1.85                     2.41 
##         On.board.service         Leg.room.service 
##                     1.58                     1.18 
##         Baggage.handling          Checkin.service 
##                     1.71                     1.16 
##         Inflight.service              Cleanliness 
##                     1.87                     2.02 
## Arrival.Delay.in.Minutes 
##                     1.02

## Area under the curve: 0.948

The model’s performance was evaluated using various metrics:

Accuracy: 0.879 (The proportion of true results among the total number of cases)
Precision: 0.859 (The proportion of true positives among all positive predictions)
Recall: 0.86 (The proportion of true positives among all actual positives)
F1 Score: 0.86 (The harmonic mean of precision and recall)
Specificity: 0.894 (The proportion of true negatives among all actual negatives)

The accuracy is quite high, at 87.9%. This means that the model correctly predicts whether a customer is satisfied or not in approximately 88 out of 100 cases. It’s a good indicator of overall performance, but it’s important to consider other metrics as well, especially if the data set is imbalanced.

Both precision and recall are also high, around 86%. Precision indicates that when the model predicts customer satisfaction, it is correct 85.9% of the time. Recall tells us that the model successfully identifies 86% of actual satisfied customers. These metrics are particularly important in scenarios where the costs of false positives and false negatives are different.

The F-Measure, which balances precision and recall, is also 0.86. This suggests a good balance between precision and recall in the model, which is crucial for a well-rounded predictive performance.

The specificity is 89.4%, indicating that the model is quite good at identifying true negatives - i.e., it correctly identifies customers who are not satisfied.

The AUC value is 0.948, which is very close to 1. This high value indicates that the model has an excellent ability to discriminate between satisfied and unsatisfied customers. It implies that the model has a high true positive rate and a low false positive rate.

Overall, the model exhibits strong predictive capabilities across various metrics, indicating that it is well-tuned for this particular task. However, it’s always important to consider the context and the potential impact of misclassifications. Also, examining other aspects like model interpretability, feature importance, and the performance on different segments of the data can provide deeper insights.

VIF results are also generally good, indicating that for most of the model’s predictors, multicollinearity is not a significant issue.

The ROC curve displayed is highly indicative of an excellent predictive model, with an AUC (Area Under the Curve) of 0.95, showing exceptional discrimination ability between the positive and negative classes. The curve stays well above the diagonal line of no-discrimination, signaling strong performance.

Decision Tree

Data Preparation

Data Type Conversion

Certain columns in both training and testing datasets are converted to factors to reflect their ordinal nature.

Column Datatype Changes - Testing Data: Conversion of certain columns to factors based on their ordinal nature.

data_test$Inflight.wifi.service = as.factor(data_test$Inflight.wifi.service)
data_test$Departure.Arrival.time.convenient = as.factor(data_test$Departure.Arrival.time.convenient)
data_test$Ease.of.Online.booking = as.factor(data_test$Ease.of.Online.booking) 
data_test$Gate.location = as.factor(data_test$Gate.location)
data_test$Food.and.drink = as.factor(data_test$Food.and.drink)
data_test$Online.boarding = as.factor(data_test$Online.boarding)
data_test$Seat.comfort = as.factor(data_test$Seat.comfort)
data_test$Inflight.entertainment = as.factor(data_test$Inflight.entertainment)
data_test$On.board.service = as.factor(data_test$On.board.service)
data_test$Leg.room.service = as.factor(data_test$Leg.room.service)
data_test$Baggage.handling = as.factor(data_test$Baggage.handling)
data_test$Checkin.service = as.factor(data_test$Checkin.service)
data_test$Inflight.service = as.factor(data_test$Inflight.service)
data_test$Cleanliness = as.factor(data_test$Cleanliness)

Column Datatype Changes - Training Data: Similar data type conversions for training data.

#Column datatype Changes - Training Data - As Columns has ordinal its better to convert into factor

data$Inflight.wifi.service = as.factor(data$Inflight.wifi.service)
data$Departure.Arrival.time.convenient = as.factor(data$Departure.Arrival.time.convenient)
data$Ease.of.Online.booking = as.factor(data$Ease.of.Online.booking) 
data$Gate.location = as.factor(data$Gate.location)
data$Food.and.drink = as.factor(data$Food.and.drink)
data$Online.boarding = as.factor(data$Online.boarding)
data$Seat.comfort = as.factor(data$Seat.comfort)
data$Inflight.entertainment = as.factor(data$Inflight.entertainment)
data$On.board.service = as.factor(data$On.board.service)
data$Leg.room.service = as.factor(data$Leg.room.service)
data$Baggage.handling = as.factor(data$Baggage.handling)
data$Checkin.service = as.factor(data$Checkin.service)
data$Inflight.service = as.factor(data$Inflight.service)
data$Cleanliness = as.factor(data$Cleanliness)

Decision Tree Model Building

Initial Model Building: A decision tree (tree) is constructed using various predictors such as customer demographics, service ratings, and flight details.
Variable Importance Analysis: The importance of each variable in the decision tree is evaluated to identify significant predictors.

This analysis helps in understanding which variables (predictors) are most influential in determining the target variable, in your case likely the ‘satisfaction’ of airline passengers.

The class of travel and type of travel are the most influential factors in determining passenger satisfaction, indicating the importance of service level and travel purpose.
Online and inflight services (boarding, entertainment, wifi) are also crucial, emphasizing the importance of digital experience and onboard comfort.
Personal factors like Age have some influence but are overshadowed by service and experience-related factors.
Several variables have no discernible impact on satisfaction in this model, suggesting that they might not be critical in the context of this specific dataset or the way the model was constructed.

This analysis provides valuable insights into what factors airlines should focus on to improve passenger satisfaction, particularly emphasizing service quality, both digital and onboard.

##                                   Overall
## Age                                   168
## Arrival.Delay.in.Minutes              131
## Class                               17608
## Ease.of.Online.booking               1671
## Inflight.entertainment              13115
## Inflight.wifi.service               12888
## Leg.room.service                     4009
## On.board.service                     2179
## Online.boarding                     16997
## Type.of.Travel                      17087
## Gender                                  0
## Customer.Type                           0
## Flight.Distance                         0
## Departure.Arrival.time.convenient       0
## Gate.location                           0
## Food.and.drink                          0
## Seat.comfort                            0
## Baggage.handling                        0
## Checkin.service                         0
## Inflight.service                        0
## Cleanliness                             0
## Departure.Delay.in.Minutes              0

Refined Model: A second decision tree (tree1) is built focusing only on the significant variables identified earlier.
Decision Tree Visualization: The structure of the refined decision tree is visualized using prp.

The decision tree shows a simplified model of how different factors contribute to the outcome of passenger satisfaction, which seems to be categorized as either satisfied or neutral.

Interpretation and Implications:

Online Boarding is a significant determinant of initial satisfaction. A better online boarding experience leads directly to a higher chance of satisfaction, bypassing other factors.
Inflight Entertainment is the second most crucial factor; however, its impact is nuanced by the previous experience with online boarding.
Type of Travel being personal indicates a more significant expectation or reliance on Inflight Entertainment for satisfaction.
It’s worth noting that the tree uses a binary split for satisfied and neutral, implying that dissatisfaction is possibly grouped with neutrality in this analysis, or dissatisfaction was not an outcome in the training data.

Based on this tree, to improve overall passenger satisfaction, an airline should focus on enhancing the online boarding process and the quality of inflight entertainment, especially for those traveling for personal reasons.

The tree simplifies the prediction of satisfaction and does not account for all the nuances or interactions between different factors but provides a quick and interpretable way to understand key drivers of satisfaction.

Model Tuning and Evaluation

Cross-Validation Setup: A 10-fold cross-validation is defined for tuning the complexity parameter (cp).
Cross-Validation Execution: The model is trained across a range of cp values to find the optimal model.

## CART 
## 
## 95704 samples
##    15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 86133, 86134, 86133, 86134, 86133, 86134, ... 
## Resampling results across tuning parameters:
## 
##   cp    RMSE   Rsquared  MAE  
##   0.01  0.277  0.686     0.153
##   0.02  0.304  0.622     0.185
##   0.03  0.312  0.601     0.195
##   0.04  0.312  0.601     0.195
##   0.05  0.312  0.601     0.195
##   0.06  0.334  0.544     0.223
##   0.07  0.334  0.544     0.223
##   0.08  0.334  0.544     0.223
##   0.09  0.389  0.380     0.303
##   0.10  0.389  0.380     0.303
##   0.11  0.389  0.380     0.303
##   0.12  0.389  0.380     0.303
##   0.13  0.426  0.259     0.362
##   0.14  0.426  0.259     0.362
##   0.15  0.426  0.259     0.362
##   0.16  0.426  0.259     0.362
##   0.17  0.426  0.259     0.362
##   0.18  0.426  0.259     0.362
##   0.19  0.426  0.259     0.362
##   0.20  0.426  0.259     0.362
##   0.21  0.426  0.259     0.362
##   0.22  0.426  0.259     0.362
##   0.23  0.426  0.259     0.362
##   0.24  0.426  0.259     0.362
##   0.25  0.426  0.259     0.362
##   0.26  0.494    NaN     0.489
##   0.27  0.494    NaN     0.489
##   0.28  0.494    NaN     0.489
##   0.29  0.494    NaN     0.489
##   0.30  0.494    NaN     0.489
##   0.31  0.494    NaN     0.489
##   0.32  0.494    NaN     0.489
##   0.33  0.494    NaN     0.489
##   0.34  0.494    NaN     0.489
##   0.35  0.494    NaN     0.489
##   0.36  0.494    NaN     0.489
##   0.37  0.494    NaN     0.489
##   0.38  0.494    NaN     0.489
##   0.39  0.494    NaN     0.489
##   0.40  0.494    NaN     0.489
##   0.41  0.494    NaN     0.489
##   0.42  0.494    NaN     0.489
##   0.43  0.494    NaN     0.489
##   0.44  0.494    NaN     0.489
##   0.45  0.494    NaN     0.489
##   0.46  0.494    NaN     0.489
##   0.47  0.494    NaN     0.489
##   0.48  0.494    NaN     0.489
##   0.49  0.494    NaN     0.489
##   0.50  0.494    NaN     0.489
## 
## RMSE was used to select the optimal model using
##  the smallest value.
## The final value used for the model was cp = 0.01.

The complexity parameter is a measure of the cost of adding additional splits to the tree. A smaller cp value allows for more splits (i.e., a more complex tree), whereas a larger cp value results in fewer splits (i.e., a simpler tree). The tuning process tested cp values from 0.01 up to 0.50.

The performance of the model at each cp value is evaluated using three metrics. RMSE (Root Mean Squared Error) measures the standard deviation of the prediction errors or residuals. Lower values are better as they indicate less deviation between the predicted and actual values. Rsquared is the coefficient of determination, indicating the proportion of the variance in the dependent variable that’s predictable from the independent variables. Higher values (close to 1) are better. MAE (Mean Absolute Error) measures the average magnitude of the errors in a set of predictions, without considering their direction. Lower values are better.

According to the summary, the optimal model was chosen with a cp value of 0.01. This model has the smallest RMSE (0.278), a reasonably high Rsquared (0.685), and the lowest MAE (0.154), suggesting that it has the best predictive performance among the models tested. As cp increases, the RMSE and MAE tend to increase while Rsquared decreases, which may indicate that the model becomes too simple and starts to underfit the data. The optimal cp of 0.01 suggests that a more complex model performs better on this dataset. Beyond a cp value of 0.25, Rsquared values are not available (NaN), which might indicate that the model performance has degraded significantly, and the predictions are no longer reliable.

The model was trained on a large sample of 95,704 instances and 15 predictors. The use of 10-fold cross-validation helps to ensure that the evaluation of the model’s performance is robust and not overly dependent on a particular split of the data.

In summary, the CART model performs best with a complexity parameter of 0.01, indicating that a model with more splits (thus more complexity) is better suited to this dataset. This model shows a good balance between bias and variance, with a relatively low prediction error and a decent explanation of variance, as per the given performance metrics.

Model Performance Analysis

ROC Curve Plotting: The Receiver Operating Characteristic (ROC) curve is plotted to evaluate the model’s true positive rate vs. false positive rate.

The Receiver Operating Characteristic (ROC) curve displayed is a graphical representation used to assess the performance of model. The curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The TPR is on the y-axis, and the FPR is on the x-axis, indicating the trade-off between benefiting from true positives and suffering from false positives. The curve shows a relatively steep ascent towards the upper left corner and then runs close to the top left corner, which indicates a good level of discrimination between the positive and negative classes. The area under the curve (AUC) is 0.89, a value close to 1, which suggests that the model has a high ability to correctly classify positive and negative cases. The closer the AUC is to 1, the better the model is at predicting true positives while minimizing false positives. An AUC of 0.89 means that there is an 89% chance that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance, which is indicative of good model performance.

Confusion Matrix and Accuracy: The confusion matrix is used to calculate the model’s accuracy at an optimal threshold identified from the ROC curve.

##    
##     FALSE  TRUE
##   0 12139  1451
##   1  1720  8553

Performance Metrics Calculation: Key metrics including Accuracy, Sensitivity (Recall), Precision, F-Measure, and Specificity are calculated.

The model’s performance was evaluated using various metrics. The results are as follows:

Accuracy: 0.867 (The proportion of true results among the total number of cases)
Precision: 0.833 (The proportion of true positives among all positive predictions)
Recall: 0.855 (The proportion of true positives among all actual positives)
F1 Score: 0.844 (The harmonic mean of precision and recall)
Specificity: 0.893 (The proportion of true negatives among all actual negatives)

AUC-ROC Value: The Area Under the Curve (AUC) for the ROC is computed, providing a single measure of the model’s overall performance.

#Testing Data AUC-ROC(Area Under the Curve - Receiver operator Characteristics) value
AUC = as.numeric(performance(pred, "auc")@y.values)

AUC-ROC Value: 0.896

Conclusion

Based on accuracy metrics, we can see that multiple models are reasonably good predictors of customer satisfaction. We have been able to limit multicollinearity through correlation matrices and VIF testing and avoid overfitting through testing/training data splits. In addition, model parameters include multiple different variable levels. This fulfills our initial research question.

While predictive validity varies, there is also an interpretation tradeoff relative to other models; in the linear model, for example, the parameters can be intuitively tied to changes in probability for each unit of an independent variable. Alternatively, the decision tree model and discussions of feature importance can yield different results. This is especially important in the context of practical outcomes; if we were to present such results to an airline executive, different models might provide different amounts of actionable information concerning inputs.

Citations

Klein, TJ (2020). Airline Passenger Satisfaction. Kaggle. https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction?select=train.csv

Lutz, A., & Lubin, G. (2012). Airlines Have An Insanely Small Profit Margin. Business Insider. https://www.businessinsider.com/airlines-have-a-small-profit-margin-2012-6

Hardee, H. (2023). Frontier reports lacklustre Q3 results as it struggles in ‘over-saturated’ core markets. FlightGlobal. https://www.flightglobal.com/strategy/frontier-reports-lacklustre-q3-results-as-it-struggles-in-over-saturated-core-markets/155561.article

vif: Variance Inflation Factors. (n.d.). R Package Documentation. https://rdrr.io/cran/car/man/vif.html

Allison, P. (2015, April 1). What’s So Special About Logit?. Statistical Horizons. https://statisticalhorizons.com/whats-so-special-about-logit/

Assumptions of Logistic Regression. (n.d.). Statistics Solutions. https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptions-of-logistic-regression/

Agarwal, P. (2019, July 8). WHAT and WHY of Log Odds. Towards Data Science. https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704