Introduction
Airline passenger satisfaction is a crucial metric for firms in the airline industry. Understanding the factors that contribute to customer satisfaction is essential for airlines to improve their services and compete effectively; high market saturation, as well as low profit margins, can magnify the effects of small advantages or disadvantages relative to other firms (Lutz et al., 2012; Hardee, 2023). In this research, we will analyze various factors that affect airline passenger satisfaction and, ultimately, judge their suitability for a regression model predicting passenger satisfaction. We will leverage a Kaggle dataset that includes surveyed passenger characteristics, flight details, and satisfaction ratings for select pre-flight and in-flight components (Klein, 2020). To ensure modeling suitability, we will conduct exploratory data analysis, taking into account variable distributions and types.
With our research, we aim to answer a few main questions. For one, to what extent do certain surveyed passenger characteristics and flight experience components impact the likelihood that a passenger will be satisfied – rather than neutral or dissatisfied – with their trip? This is the key focus of our research; we want to identify meaningful inputs for satisfaction and estimate the magnitude of their effects. Secondly, how can we model the likelihood of passenger satisfaction using surveyed passenger characteristics and flight experience components in a manner that minimizes predictive bias? While assembling our models, we need to ensure that issues such as multicollinearity and overfitting do not jeopardize our models’ predictive validity. Finally, to what extent can we predict the likelihood that a flight passenger will be satisfied with their experience using multiple different variable levels? Our dataset utilizes continuous, ordinal, and categorical variables, all of which can require differing assumptions when used in modeling; incorporating these different variable levels into a model is an important step in predicting satisfaction.
The dataset for our research on airline passenger satisfaction contains various variables, which can be categorized into three types: continuous, categorical, and ordinal. Continuous variables include passenger age, flight distance, arrival delays, and departure delays. Categorical variables include gender, customer type (loyalty), the type of travel (business or personal), and the travel class (business, economy, or economy plus). Ordinal variables include a number of ratings from 0-5 concerning specific aspects of the flight experience. The “Satisfaction” variable represents the airline passenger’s satisfaction level and includes two categories: “satisfied” or “neutral or dissatisfied.” This will be our primary outcome variable for analysis.
Variable limitations
While the analysis and insight generation opportunities are manyfold, certain fields in this dataset can present challenges limiting a resulting model’s predictive validity. One critical issue is data collection; While some variable-related documentation is available, we are not able to discern the circumstances under which this survey was distributed using the Kaggle source (Klein, 2020). The population may have been sampled through certain methods—such as convenience sampling—that make resulting data less representative of the overall population despite the large observation count. The overall population in question also is not clear; the survey may have focused on a particular airport or region, limiting potential predictive validity in alternative settings.
Another issue is that the document does not elaborate upon what counts as a “loyal” or “disloyal” customer for the customer type field. This makes it difficult to properly interpret the effects of such a variable in a regression model. The threshold for disloyalty could potentially range from using any other airlines at all to using other airlines a majority of the time, drastically altering any potential real-world applications.
A third—but not final—problematic factor is that ticket prices are not included in this survey, with class serving as a rough proxy; intuitively, such prices could play a major factor in passengers’ service expectations and their subsequent ratings. The lack of price ranges associated with seat class also makes it difficult to encode the three categories in a way that accurately captures the disparity.
X | id | Gender | Customer.Type | Age | Type.of.Travel | Class | Flight.Distance | Inflight.wifi.service | Departure.Arrival.time.convenient | Ease.of.Online.booking | Gate.location | Food.and.drink | Online.boarding | Seat.comfort | Inflight.entertainment | On.board.service | Leg.room.service | Baggage.handling | Checkin.service | Inflight.service | Cleanliness | Departure.Delay.in.Minutes | Arrival.Delay.in.Minutes | satisfaction |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 70172 | Male | Loyal Customer | 13 | Personal Travel | Eco Plus | 460 | 3 | 4 | 3 | 1 | 5 | 3 | 5 | 5 | 4 | 3 | 4 | 4 | 5 | 5 | 25 | 18 | neutral or dissatisfied |
1 | 5047 | Male | disloyal Customer | 25 | Business travel | Business | 235 | 3 | 2 | 3 | 3 | 1 | 3 | 1 | 1 | 1 | 5 | 3 | 1 | 4 | 1 | 1 | 6 | neutral or dissatisfied |
2 | 110028 | Female | Loyal Customer | 26 | Business travel | Business | 1142 | 2 | 2 | 2 | 2 | 5 | 5 | 5 | 5 | 4 | 3 | 4 | 4 | 4 | 5 | 0 | 0 | satisfied |
3 | 24026 | Female | Loyal Customer | 25 | Business travel | Business | 562 | 2 | 5 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 5 | 3 | 1 | 4 | 2 | 11 | 9 | neutral or dissatisfied |
4 | 119299 | Male | Loyal Customer | 61 | Business travel | Business | 214 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 0 | 0 | satisfied |
Data structure
## 'data.frame': 103904 obs. of 25 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 70172 5047 110028 24026 119299 111157 82113 96462 79485 65725 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Customer.Type : chr "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 13 25 26 25 61 26 47 52 41 20 ...
## $ Type.of.Travel : chr "Personal Travel" "Business travel" "Business travel" "Business travel" ...
## $ Class : chr "Eco Plus" "Business" "Business" "Business" ...
## $ Flight.Distance : int 460 235 1142 562 214 1180 1276 2035 853 1061 ...
## $ Inflight.wifi.service : int 3 3 2 2 3 3 2 4 1 3 ...
## $ Departure.Arrival.time.convenient: int 4 2 2 5 3 4 4 3 2 3 ...
## $ Ease.of.Online.booking : int 3 3 2 5 3 2 2 4 2 3 ...
## $ Gate.location : int 1 3 2 5 3 1 3 4 2 4 ...
## $ Food.and.drink : int 5 1 5 2 4 1 2 5 4 2 ...
## $ Online.boarding : int 3 3 5 2 5 2 2 5 3 3 ...
## $ Seat.comfort : int 5 1 5 2 5 1 2 5 3 3 ...
## $ Inflight.entertainment : int 5 1 5 2 3 1 2 5 1 2 ...
## $ On.board.service : int 4 1 4 2 3 3 3 5 1 2 ...
## $ Leg.room.service : int 3 5 3 5 4 4 3 5 2 3 ...
## $ Baggage.handling : int 4 3 4 3 4 4 4 5 1 4 ...
## $ Checkin.service : int 4 1 4 1 3 4 3 4 4 4 ...
## $ Inflight.service : int 5 4 4 4 3 4 5 5 1 3 ...
## $ Cleanliness : int 5 1 5 2 3 1 2 4 2 2 ...
## $ Departure.Delay.in.Minutes : int 25 1 0 11 0 0 9 4 0 0 ...
## $ Arrival.Delay.in.Minutes : num 18 6 0 9 0 0 23 0 0 0 ...
## $ satisfaction : chr "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...
Data dimensions
This is a data frame with 103904 observations (rows) and 25 variables (columns). Assuming that a robust sampling method was utilized, the large number of observations may allow us to conclude that the data is generally representative of the actual population.
An initial description of the data
## data
##
## 25 Variables 103904 Observations
## ------------------------------------------------------------
## X
## n missing distinct Info Mean Gmd
## 103904 0 103904 1 51952 34635
## .05 .10 .25 .50 .75 .90
## 5195 10390 25976 51952 77927 93513
## .95
## 98708
##
## lowest : 0 1 2 3 4
## highest: 103899 103900 103901 103902 103903
## ------------------------------------------------------------
## id
## n missing distinct Info Mean Gmd
## 103904 0 103904 1 64924 43260
## .05 .10 .25 .50 .75 .90
## 6593 13044 32534 64857 97368 116884
## .95
## 123410
##
## lowest : 1 2 3 4 5
## highest: 129874 129875 129878 129879 129880
## ------------------------------------------------------------
## Gender
## n missing distinct
## 103904 0 2
##
## Value Female Male
## Frequency 52727 51177
## Proportion 0.507 0.493
## ------------------------------------------------------------
## Customer.Type
## n missing distinct
## 103904 0 2
##
## Value disloyal Customer Loyal Customer
## Frequency 18981 84923
## Proportion 0.183 0.817
## ------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd
## 103904 0 75 1 39.38 17.32
## .05 .10 .25 .50 .75 .90
## 14 20 27 40 51 59
## .95
## 64
##
## lowest : 7 8 9 10 11, highest: 77 78 79 80 85
## ------------------------------------------------------------
## Type.of.Travel
## n missing distinct
## 103904 0 2
##
## Value Business travel Personal Travel
## Frequency 71655 32249
## Proportion 0.69 0.31
## ------------------------------------------------------------
## Class
## n missing distinct
## 103904 0 3
##
## Value Business Eco Eco Plus
## Frequency 49665 46745 7494
## Proportion 0.478 0.450 0.072
## ------------------------------------------------------------
## Flight.Distance
## n missing distinct Info Mean Gmd
## 103904 0 3802 1 1189 1066
## .05 .10 .25 .50 .75 .90
## 175 236 414 843 1743 2750
## .95
## 3383
##
## lowest : 31 56 67 73 74, highest: 4243 4502 4817 4963 4983
## ------------------------------------------------------------
## Inflight.wifi.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.956 2.73 1.492
##
## Value 0 1 2 3 4 5
## Frequency 3103 17840 25830 25868 19794 11469
## Proportion 0.030 0.172 0.249 0.249 0.191 0.110
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Arrival.time.convenient
## n missing distinct Info Mean Gmd
## 103904 0 6 0.962 3.06 1.716
##
## Value 0 1 2 3 4 5
## Frequency 5300 15498 17191 17966 25546 22403
## Proportion 0.051 0.149 0.165 0.173 0.246 0.216
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Ease.of.Online.booking
## n missing distinct Info Mean Gmd
## 103904 0 6 0.961 2.757 1.578
##
## Value 0 1 2 3 4 5
## Frequency 4487 17525 24021 24449 19571 13851
## Proportion 0.043 0.169 0.231 0.235 0.188 0.133
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Gate.location
## n missing distinct Info Mean Gmd
## 103904 0 6 0.952 2.977 1.437
##
## Value 0 1 2 3 4 5
## Frequency 1 17562 19459 28577 24426 13879
## Proportion 0.000 0.169 0.187 0.275 0.235 0.134
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Food.and.drink
## n missing distinct Info Mean Gmd
## 103904 0 6 0.956 3.202 1.499
##
## Value 0 1 2 3 4 5
## Frequency 107 12837 21988 22300 24359 22313
## Proportion 0.001 0.124 0.212 0.215 0.234 0.215
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Online.boarding
## n missing distinct Info Mean Gmd
## 103904 0 6 0.951 3.25 1.501
##
## Value 0 1 2 3 4 5
## Frequency 2428 10692 17505 21804 30762 20713
## Proportion 0.023 0.103 0.168 0.210 0.296 0.199
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Seat.comfort
## n missing distinct Info Mean Gmd
## 103904 0 6 0.945 3.439 1.462
##
## Value 0 1 2 3 4 5
## Frequency 1 12075 14897 18696 31765 26470
## Proportion 0.000 0.116 0.143 0.180 0.306 0.255
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.entertainment
## n missing distinct Info Mean Gmd
## 103904 0 6 0.95 3.358 1.49
##
## Value 0 1 2 3 4 5
## Frequency 14 12478 17637 19139 29423 25213
## Proportion 0.000 0.120 0.170 0.184 0.283 0.243
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## On.board.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.947 3.382 1.433
##
## Value 0 1 2 3 4 5
## Frequency 3 11872 14681 22833 30867 23648
## Proportion 0.000 0.114 0.141 0.220 0.297 0.228
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Leg.room.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.95 3.351 1.471
##
## Value 0 1 2 3 4 5
## Frequency 472 10353 19525 20098 28789 24667
## Proportion 0.005 0.100 0.188 0.193 0.277 0.237
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Baggage.handling
## n missing distinct Info Mean Gmd
## 103904 0 5 0.926 3.632 1.282
##
## Value 1 2 3 4 5
## Frequency 7237 11521 20632 37383 27131
## Proportion 0.070 0.111 0.199 0.360 0.261
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Checkin.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.946 3.304 1.408
##
## Value 0 1 2 3 4 5
## Frequency 1 12890 12893 28446 29055 20619
## Proportion 0.000 0.124 0.124 0.274 0.280 0.198
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.924 3.64 1.274
##
## Value 0 1 2 3 4 5
## Frequency 3 7084 11457 20299 37945 27116
## Proportion 0.000 0.068 0.110 0.195 0.365 0.261
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Cleanliness
## n missing distinct Info Mean Gmd
## 103904 0 6 0.953 3.286 1.471
##
## Value 0 1 2 3 4 5
## Frequency 12 13318 16132 24574 27179 22689
## Proportion 0.000 0.128 0.155 0.237 0.262 0.218
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Delay.in.Minutes
## n missing distinct Info Mean Gmd
## 103904 0 446 0.82 14.82 24.68
## .05 .10 .25 .50 .75 .90
## 0 0 0 0 12 44
## .95
## 78
##
## lowest : 0 1 2 3 4, highest: 933 978 1017 1305 1592
## ------------------------------------------------------------
## Arrival.Delay.in.Minutes
## n missing distinct Info Mean Gmd
## 103594 310 455 0.823 15.18 25.15
## .05 .10 .25 .50 .75 .90
## 0 0 0 0 13 44
## .95
## 79
##
## lowest : 0 1 2 3 4, highest: 952 970 1011 1280 1584
## ------------------------------------------------------------
## satisfaction
## n missing distinct
## 103904 0 2
##
## Value neutral or dissatisfied satisfied
## Frequency 58879 45025
## Proportion 0.567 0.433
## ------------------------------------------------------------
Data pre-processing
Duplicate values
We first imported the data into R by using read.csv()
function. The first few rows in the dataset are included above. This is
a data frame with 103904 observations (rows) and
25 variables (columns). Assuming that a robust sampling
method was utilized, the large number of observations may allow us to
conclude that the data is generally representative of the actual
population. However, the data required some cleaning before use in
testing. One issue was that the arrival delays field included a number
of NA values; we elected to replace these with the median delay. This
method was used over other potential replacement options, such as the
average, due to the skewed distribution of values detailed later on.
Apart from that, ratings responses equaling 0 indicate that the question
was not applicable; respondents that select this option for any of the
ratings variables are filtered out to ensure that all of the individual
ratings are relevant for all observations. While alternatives exist,
such as replacement, the large number of initial observations limited
our concerns over a potential loss in predictive validity. All steps
were repeated for both the training and testing datasets.
Examining variable distributions
Following data pre-processing, we plotted variable distributions to attempt to identify potential trends and correlations. Given a robust sampling method, we can safely assume that these distributions (including the highly skewed ones) are representative of the overall population. Initially, none of the categorical fields appear to be highly correlated, but we intend to confirm this using variance inflation factor (VIF) analysis following initial model creation (“vif: Variance Inflation Factors”, n.d.). Looking at the distribution of class, Eco Plus has a significantly lower observation frequency than the other two. In addition, as noted earlier, the magnitudes of increments between Eco, Eco Plus, and Business are not clear; we noted that some transformation may be required later to ensure modeling suitability.
When plotting continuous variable distributions, flight distance as well as both delay variables have a strong right skew. This makes sense intuitively; we would expect most flights to have minimal to no delays, and shorter flights are likely more frequent. Age appears to be bimodal to a degree, with a small peak around 20-25 and another peak roughly around 35-50. Depending on the type of regression that is ultimately selected, some of these variables may require aggressive transformations to better approximate normal distributions. Many of the distributions for individual ratings variables look quite similar, raising multicollinearity concerns that will be addressed later.
Frequency distributions for categorical variables
Frequency distributions for continuous variables
Frequency distributions for ordinal variables (Ratings)
Distributions with respect to satisfaction
We also used plots to visually discern differences in continuous variables between satisfied and unsatisfied groups, potentially revealing significant model inputs. The first step was to use box-plots for continuous variables. We found that older passengers tend to be more satisfied with their flights compared to their younger counterparts. Also, on average, passengers who embark on longer journeys tend to report higher levels of satisfaction. The basis for this trend is unclear at this time, but further investigation may yield actionable conclusions in this regard. Flights experiencing greater departure delays appear to have a slightly higher proportion of neutral or dissatisfied customers, which supports the intuition that prolonged delays before takeoff may negatively affect passenger contentment. Similarly to departure delays, flights with higher arrival delays tend to exhibit a marginally increased prevalence of neutral or dissatisfied customers. This underscores the potential impact of delays—both at departure and arrival—on passenger satisfaction, although more investigation was required to uncover the exact nature of this relationship. A scatterplot uncovered potential multicollinearity concerns to be addressed later.
Histograms for categorical variables uncovered a distinct trend in terms of customer loyalty. Loyal customers, those who have a history of repeat business with the airline, tend to report higher levels of satisfaction compared to disloyal or infrequent flyers. There is a significant satisfaction discrepancy between individuals traveling for business and personal reasons; a majority of business travelers were satisfied, while an overwhelming proportion of personal travelers expressed dissatisfaction or neutrality. The nature of this relationship, as well as actionable insights that may be drawn from it, are unclear at this point. Business class passengers stand out as notably more satisfied than those in Economy or Economy Plus. If proven to be statistically significant, this factor could spur class-specific service and amenity adjustments for efficient satisfaction gains. It might also warrant future study detailing meaningful distinctions in the flight experience between classes. The notable exception here is gender, across which there were no notable differences in satisfaction.
Continuous variable boxplots
Categorical variable histograms
Continuous variable KDE (Kernel Density Estimation) plots
Arrival and departure delay scatterplot
Correlation matrices
Our final EDA step was to example multicollinearity; to accomplish this, we built two correlation matrices for continuous and ordinal (ratings) variables respectively. As observed earlier, arrival and departure delays appear to be highly correlated; certain steps, such as removing one of the two or calculating an average delay variable, would likely be necessary for use in a predictive model. We also found that certain ratings variables have strong positive correlations with each other. If these are included in the model without adjustments, our model may suffer a loss in reliability. In order to avoid this issue, we elected to combine ratings variables into two groups—based on the degree of correlation—and utilize average ratings from these two groups as model inputs.
Continuous variable correlations
## Age Flight.Distance
## Min. :-0.016 Min. :-0.004
## 1st Qu.:-0.014 1st Qu.:-0.001
## Median : 0.035 Median : 0.042
## Mean : 0.264 Mean : 0.270
## 3rd Qu.: 0.312 3rd Qu.: 0.312
## Max. : 1.000 Max. : 1.000
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
## Min. :-0.013 Min. :-0.016
## 1st Qu.:-0.003 1st Qu.:-0.007
## Median : 0.480 Median : 0.478
## Mean : 0.487 Mean : 0.485
## 3rd Qu.: 0.970 3rd Qu.: 0.970
## Max. : 1.000 Max. : 1.000
Ratings variable correlations
Aggregated ratings variable inclusions and summary statistics
Ratings Group 1: Pre-Flight & Wi-Fi | Ratings Group 2: In-Flight & Baggage |
---|---|
In-Flight Wifi Service | Food and Drink |
Departure / Arrival Time | Seat Comfort |
Ease of Online Booking | In-Flight Entertainment |
Gate Location | Onboard Service |
Online Boarding | Leg Room Service |
Baggage Handling | |
Check-In Service | |
In-Flight Service | |
Cleanliness |
## Pre_Flight_and_WiFi_Ratings In_Flight_and_Baggage_Ratings
## Min. :1.00 Min. :1.11
## 1st Qu.:2.40 1st Qu.:2.78
## Median :3.00 Median :3.44
## Mean :3.04 Mean :3.41
## 3rd Qu.:3.80 3rd Qu.:4.00
## Max. :5.00 Max. :5.00
Probability and standard OLS estimates
Before engaging in further analysis, we first identified that satisfaction—as a categorical/binary variable—runs into a fundamental interpretation issue under a standard linear model, where the standard linear model is not bounded between 0 and 1 in the same manner as our satisfaction variable. Under certain inputs, the linear model predicts unattainable values between satisfied or neutral/dissatisfied (encoded as 1 and 0 respectively), and key assumptions of linearity and homoskedasticity are violated.
Despite this restriction, linear probability models remain in widespread use, particularly among social scientists, making this a potentially fruitful avenue for a predictive model (Allison, 2015). This largely stems from ease of interpretation and generation; unlike logit (to be discussed later), this directly predicts changes in probability rather than odds ratios, is easier to run, and approximates logit for the 0.2-0.8 probability range in most cases (Allison, 2020). We generated a linear model and used a t-test with robust standard errors to account for violated homoskedasticity assumptions.
##
## Call:
## lm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings +
## Arrival.Delay.in.Minutes, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0817 -0.2223 0.0047 0.1975 1.4188
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.29e+00 6.59e-03 -196.47
## Gender -3.76e-04 2.13e-03 -0.18
## Customer.Type 3.57e-01 3.39e-03 105.29
## Age 1.72e-04 7.43e-05 2.31
## Type.of.Travel 4.36e-01 3.08e-03 141.47
## Class 1.24e-01 2.95e-03 42.12
## Flight.Distance 5.98e-06 1.24e-06 4.83
## Pre_Flight_and_WiFi_Ratings 9.04e-02 1.18e-03 76.51
## In_Flight_and_Baggage_Ratings 2.28e-01 1.46e-03 156.62
## Arrival.Delay.in.Minutes -4.61e-04 2.75e-05 -16.74
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Gender 0.860
## Customer.Type < 2e-16 ***
## Age 0.021 *
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Flight.Distance 1.4e-06 ***
## Pre_Flight_and_WiFi_Ratings < 2e-16 ***
## In_Flight_and_Baggage_Ratings < 2e-16 ***
## Arrival.Delay.in.Minutes < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.33 on 95694 degrees of freedom
## Multiple R-squared: 0.555, Adjusted R-squared: 0.555
## F-statistic: 1.33e+04 on 9 and 95694 DF, p-value: <2e-16
##
## t test of coefficients:
##
## Estimate Std. Error t value
## (Intercept) -1.29e+00 5.75e-03 -225.32
## Gender -3.76e-04 2.14e-03 -0.18
## Customer.Type 3.57e-01 3.89e-03 91.83
## Age 1.72e-04 7.58e-05 2.27
## Type.of.Travel 4.36e-01 3.39e-03 128.45
## Class 1.24e-01 3.35e-03 37.12
## Flight.Distance 5.98e-06 1.22e-06 4.90
## Pre_Flight_and_WiFi_Ratings 9.04e-02 1.25e-03 72.10
## In_Flight_and_Baggage_Ratings 2.28e-01 1.53e-03 149.51
## Arrival.Delay.in.Minutes -4.61e-04 3.06e-05 -15.06
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Gender 0.860
## Customer.Type < 2e-16 ***
## Age 0.023 *
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Flight.Distance 9.8e-07 ***
## Pre_Flight_and_WiFi_Ratings < 2e-16 ***
## In_Flight_and_Baggage_Ratings < 2e-16 ***
## Arrival.Delay.in.Minutes < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on our linear model, all inputs apart from gender and age have statistically significant impacts on satisfaction likelihood. As mentioned earlier, one major advantage from the linear model is that coefficients can be easily interpreted. For instance, loyal customers display a 0.357 (35.7%) increase in predicted satisfaction probability relative to others. In a similar vein, the model predicts a 43.5% higher satisfaction probability for passengers traveling for business relative to others. For the non-binary aggregated ratings, a 1-point increase corresponds to 9.07% and 22.9% predicted satisfaction probability increases for the pre-flight and in-flight groups respectively.
However, to confirm that the linear model is indeed a practically valuable predictor, we can’t rely solely on the dataset used for training; our source provides a second testing dataset for which we can repeat cleaning/encoding steps and apply our model. Since gender and age are not significant, we elected to remove them prior to this step (marking this as a “v2” model). Using a confusion matrix, we determined that the v2 model’s “accuracy”—the proportion of correctly predicted satisfaction values out of all respondents—is over 80% for the testing dataset. Based on this information, we can conclude that the linear model is a reasonably good predictor that isn’t overfitting the training data.
##
## Call:
## lm(formula = satisfaction ~ Customer.Type + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings +
## Arrival.Delay.in.Minutes, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0819 -0.2222 0.0049 0.1976 1.4174
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.29e+00 6.24e-03 -206.66
## Customer.Type 3.59e-01 3.28e-03 109.58
## Type.of.Travel 4.36e-01 3.07e-03 142.36
## Class 1.25e-01 2.95e-03 42.27
## Flight.Distance 5.88e-06 1.24e-06 4.75
## Pre_Flight_and_WiFi_Ratings 9.05e-02 1.18e-03 76.57
## In_Flight_and_Baggage_Ratings 2.28e-01 1.46e-03 156.68
## Arrival.Delay.in.Minutes -4.61e-04 2.75e-05 -16.77
## Pr(>|t|)
## (Intercept) <2e-16 ***
## Customer.Type <2e-16 ***
## Type.of.Travel <2e-16 ***
## Class <2e-16 ***
## Flight.Distance 2e-06 ***
## Pre_Flight_and_WiFi_Ratings <2e-16 ***
## In_Flight_and_Baggage_Ratings <2e-16 ***
## Arrival.Delay.in.Minutes <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.33 on 95696 degrees of freedom
## Multiple R-squared: 0.555, Adjusted R-squared: 0.555
## F-statistic: 1.7e+04 on 7 and 95696 DF, p-value: <2e-16
##
## 0 1
## 0 11937 1653
## 1 1557 8716
- Accuracy: 0.865
However, it is not yet clear that a linear model would be the best predictor available. Logistic regression, which predicts the log odds of satisfaction. is the dominant approach for modeling binary variables (Allison, 2015). Logistic regression models utilize different assumptions relative to linear models, significantly altering the necessary EDA steps. Rather than a linear relationship between parameters and the dependent variable, logistic regression assumes a linear relationship between parameters and the log odds. Independence of errors and multicollinearity remain as assumptions for both linear and logistic models. Homoskedasticity and normally distributed residuals are both not required under logistic regression (“Assumptions of Logistic Regression”, n.d.).
Unlike a standard linear regression, which assumes that independent parameters have a linear relationship with the dependent variable, logistic regression assumes that parameters have a linear relationship with the log odds (“Assumptions of Logistic Regression”, n.d.).
Odds represent the number of favorable outcomes divided by the number of unfavorable outcomes. Put differently, if “p” represents the probability of favorable outcomes, Odds = p/(1-p). Log odds take the natural log of the odds, which can be expressed as ln(p/1-p)) (Agarwal, 2019). We used visual test to examine whether or not this assumption holds true for continuous variables. While it is not sensible to compute log odds for individual data points, we grouped continuous variables into discrete buckets—calculating the average log odds for each—to examine whether or not they might satisfy this assumption.
Only flight distance, as well as in-flight and baggage ratings, displayed roughly linear relationships with log odds of satisfaction in our testing. Age appeared to have a parabolic relationship, peaking in the middle, indicating some sort of aggressive transformation method may be necessary to reach a linear relationship. Meanwhile, log odds for both delay statistics quickly dispersed in both directions as they increase (likely in part due to the limited frequency of higher durations), making it difficult to conclude with certainty that a linear relationship exists. Pre-flight and wi-fi ratings appear to have a significantly looser connection relative to in-flight ratings with a potential dip in log odds for average ratings.
Building and testing a logit model
Testing linearity with log odds
Following visual testing, we generated a logit model in order to examine potential differences relative to the prior linear model. Rather than starting with a pared-down variable list, we returned to an expanded variable list to see if there were any distinctions in what the models deemed statistically significant. This proved to be informative; alongside gender and age, flight distance also failed to reach the threshold for statistical significance.
logit_model = glm(satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel + Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings, data = data, family = "binomial")
summary(logit_model)
##
## Call:
## glm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.47e+01 9.86e-02 -148.85
## Gender 5.79e-03 2.06e-02 0.28
## Customer.Type 2.48e+00 3.19e-02 77.57
## Age 7.10e-04 7.42e-04 0.96
## Type.of.Travel 3.33e+00 3.24e-02 102.75
## Class 8.32e-01 2.56e-02 32.53
## Flight.Distance 1.45e-05 1.18e-05 1.23
## Pre_Flight_and_WiFi_Ratings 8.30e-01 1.23e-02 67.58
## In_Flight_and_Baggage_Ratings 1.96e+00 1.67e-02 116.80
## Pr(>|z|)
## (Intercept) <2e-16 ***
## Gender 0.78
## Customer.Type <2e-16 ***
## Age 0.34
## Type.of.Travel <2e-16 ***
## Class <2e-16 ***
## Flight.Distance 0.22
## Pre_Flight_and_WiFi_Ratings <2e-16 ***
## In_Flight_and_Baggage_Ratings <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 61179 on 95695 degrees of freedom
## AIC: 61197
##
## Number of Fisher Scoring iterations: 6
In order to compare this with the linear model, we generated another confusion matrix based on the testing data. In a similar fashion to the linear model, we created a “v2” model removing statistically insignificant inputs. The accuracy results were better than those of the linear model, but only slightly; it isn’t clear whether this marginal improvement would hold true given further testing with different survey data. The calculated McFadden pseudo-R^2 falls above 0.5.
##
## Call:
## glm(formula = satisfaction ~ Customer.Type + Type.of.Travel +
## Class + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -14.6554 0.0965 -151.9
## Customer.Type 2.4940 0.0298 83.7
## Type.of.Travel 3.3316 0.0321 103.8
## Class 0.8450 0.0236 35.9
## Pre_Flight_and_WiFi_Ratings 0.8302 0.0123 67.6
## In_Flight_and_Baggage_Ratings 1.9558 0.0167 116.9
## Pr(>|z|)
## (Intercept) <2e-16 ***
## Customer.Type <2e-16 ***
## Type.of.Travel <2e-16 ***
## Class <2e-16 ***
## Pre_Flight_and_WiFi_Ratings <2e-16 ***
## In_Flight_and_Baggage_Ratings <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 61181 on 95698 degrees of freedom
## AIC: 61193
##
## Number of Fisher Scoring iterations: 6
##
## 0 1
## 0 12635 955
## 1 2189 8084
## [1] "Accuracy: 0.868"
## [1] "McFadden R^2: 0.531"
Constructing the initial logit model
##
## Call:
## glm(formula = satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service +
## Ease.of.Online.booking + Online.boarding + Seat.comfort +
## Inflight.entertainment + On.board.service + Leg.room.service +
## Baggage.handling + Checkin.service + Inflight.service + Cleanliness +
## Arrival.Delay.in.Minutes, family = binomial(), data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.34e+01 9.93e-02 -134.91
## Age 1.20e-02 7.83e-04 15.33
## Type.of.Travel 2.35e+00 3.28e-02 71.69
## Class 1.26e+00 2.66e-02 47.32
## Inflight.wifi.service 6.25e-01 1.29e-02 48.35
## Ease.of.Online.booking -4.75e-02 1.11e-02 -4.27
## Online.boarding 1.03e+00 1.22e-02 84.35
## Seat.comfort 2.83e-02 1.26e-02 2.25
## Inflight.entertainment 3.12e-01 1.48e-02 21.16
## On.board.service 3.06e-01 1.14e-02 26.74
## Leg.room.service 3.62e-01 9.86e-03 36.74
## Baggage.handling 5.58e-02 1.27e-02 4.40
## Checkin.service 2.49e-01 9.39e-03 26.54
## Inflight.service 1.88e-02 1.34e-02 1.41
## Cleanliness 1.10e-01 1.27e-02 8.67
## Arrival.Delay.in.Minutes -3.87e-03 2.83e-04 -13.68
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Age < 2e-16 ***
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Inflight.wifi.service < 2e-16 ***
## Ease.of.Online.booking 1.9e-05 ***
## Online.boarding < 2e-16 ***
## Seat.comfort 0.024 *
## Inflight.entertainment < 2e-16 ***
## On.board.service < 2e-16 ***
## Leg.room.service < 2e-16 ***
## Baggage.handling 1.1e-05 ***
## Checkin.service < 2e-16 ***
## Inflight.service 0.159
## Cleanliness < 2e-16 ***
## Arrival.Delay.in.Minutes < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 54781 on 95688 degrees of freedom
## AIC: 54813
##
## Number of Fisher Scoring iterations: 6
Observations
Our initial model yields some interesting observations. For one, most variables have p-values less than 0.05, indicating they significantly influence the dependent variable. Positive coefficients (e.g., ‘Age’, ‘Type of Travel’) suggest a positive relationship with the outcome, whereas negative coefficients (e.g., ‘Ease of Online booking’, ‘Arrival Delay in Minutes’) indicate a negative relationship. Variables with larger coefficients and small standard errors, like ‘Online boarding’ and ‘Type of Travel’, may have a more substantial impact on the outcome. The large difference between the null and residual deviance suggests a good model fit. Some variables, like ‘Inflight service’, do not show statistical significance, implying a weaker or no influence on the dependent variable. Finally, the model seems capable of predicting the outcome effectively, given the significance and size of most coefficients.
## log_pred_class
## 0 1
## 0 12151 1439
## 1 1445 8828
## Age Type.of.Travel
## 1.06 1.42
## Class Inflight.wifi.service
## 1.47 1.85
## Ease.of.Online.booking Online.boarding
## 1.68 1.27
## Seat.comfort Inflight.entertainment
## 1.85 2.41
## On.board.service Leg.room.service
## 1.58 1.18
## Baggage.handling Checkin.service
## 1.71 1.16
## Inflight.service Cleanliness
## 1.87 2.02
## Arrival.Delay.in.Minutes
## 1.02
## Area under the curve: 0.948
The model’s performance was evaluated using various metrics:
- Accuracy: 0.879 (The proportion of true results among the total number of cases)
- Precision: 0.859 (The proportion of true positives among all positive predictions)
- Recall: 0.86 (The proportion of true positives among all actual positives)
- F1 Score: 0.86 (The harmonic mean of precision and recall)
- Specificity: 0.894 (The proportion of true negatives among all actual negatives)
The accuracy is quite high, at 87.9%. This means that the model correctly predicts whether a customer is satisfied or not in approximately 88 out of 100 cases. It’s a good indicator of overall performance, but it’s important to consider other metrics as well, especially if the data set is imbalanced.
Both precision and recall are also high, around 86%. Precision indicates that when the model predicts customer satisfaction, it is correct 85.9% of the time. Recall tells us that the model successfully identifies 86% of actual satisfied customers. These metrics are particularly important in scenarios where the costs of false positives and false negatives are different.
The F-Measure, which balances precision and recall, is also 0.86. This suggests a good balance between precision and recall in the model, which is crucial for a well-rounded predictive performance.
The specificity is 89.4%, indicating that the model is quite good at identifying true negatives - i.e., it correctly identifies customers who are not satisfied.
The AUC value is 0.948, which is very close to 1. This high value indicates that the model has an excellent ability to discriminate between satisfied and unsatisfied customers. It implies that the model has a high true positive rate and a low false positive rate.
Overall, the model exhibits strong predictive capabilities across various metrics, indicating that it is well-tuned for this particular task. However, it’s always important to consider the context and the potential impact of misclassifications. Also, examining other aspects like model interpretability, feature importance, and the performance on different segments of the data can provide deeper insights.
VIF results are also generally good, indicating that for most of the model’s predictors, multicollinearity is not a significant issue.
The ROC curve displayed is highly indicative of an excellent predictive model, with an AUC (Area Under the Curve) of 0.95, showing exceptional discrimination ability between the positive and negative classes. The curve stays well above the diagonal line of no-discrimination, signaling strong performance.
Decision Tree
Data Preparation
Data Type Conversion
- Certain columns in both training and testing datasets are converted to factors to reflect their ordinal nature.
Column Datatype Changes - Testing Data: Conversion of certain columns to factors based on their ordinal nature.
data_test$Inflight.wifi.service = as.factor(data_test$Inflight.wifi.service)
data_test$Departure.Arrival.time.convenient = as.factor(data_test$Departure.Arrival.time.convenient)
data_test$Ease.of.Online.booking = as.factor(data_test$Ease.of.Online.booking)
data_test$Gate.location = as.factor(data_test$Gate.location)
data_test$Food.and.drink = as.factor(data_test$Food.and.drink)
data_test$Online.boarding = as.factor(data_test$Online.boarding)
data_test$Seat.comfort = as.factor(data_test$Seat.comfort)
data_test$Inflight.entertainment = as.factor(data_test$Inflight.entertainment)
data_test$On.board.service = as.factor(data_test$On.board.service)
data_test$Leg.room.service = as.factor(data_test$Leg.room.service)
data_test$Baggage.handling = as.factor(data_test$Baggage.handling)
data_test$Checkin.service = as.factor(data_test$Checkin.service)
data_test$Inflight.service = as.factor(data_test$Inflight.service)
data_test$Cleanliness = as.factor(data_test$Cleanliness)
Column Datatype Changes - Training Data: Similar data type conversions for training data.
#Column datatype Changes - Training Data - As Columns has ordinal its better to convert into factor
data$Inflight.wifi.service = as.factor(data$Inflight.wifi.service)
data$Departure.Arrival.time.convenient = as.factor(data$Departure.Arrival.time.convenient)
data$Ease.of.Online.booking = as.factor(data$Ease.of.Online.booking)
data$Gate.location = as.factor(data$Gate.location)
data$Food.and.drink = as.factor(data$Food.and.drink)
data$Online.boarding = as.factor(data$Online.boarding)
data$Seat.comfort = as.factor(data$Seat.comfort)
data$Inflight.entertainment = as.factor(data$Inflight.entertainment)
data$On.board.service = as.factor(data$On.board.service)
data$Leg.room.service = as.factor(data$Leg.room.service)
data$Baggage.handling = as.factor(data$Baggage.handling)
data$Checkin.service = as.factor(data$Checkin.service)
data$Inflight.service = as.factor(data$Inflight.service)
data$Cleanliness = as.factor(data$Cleanliness)
Decision Tree Model Building
Initial Model Building: A decision tree (
tree
) is constructed using various predictors such as customer demographics, service ratings, and flight details.Variable Importance Analysis: The importance of each variable in the decision tree is evaluated to identify significant predictors.
This analysis helps in understanding which variables (predictors) are most influential in determining the target variable, in your case likely the ‘satisfaction’ of airline passengers.
- The class of travel and type of travel are the most influential factors in determining passenger satisfaction, indicating the importance of service level and travel purpose.
- Online and inflight services (boarding, entertainment, wifi) are also crucial, emphasizing the importance of digital experience and onboard comfort.
- Personal factors like Age have some influence but are overshadowed by service and experience-related factors.
- Several variables have no discernible impact on satisfaction in this model, suggesting that they might not be critical in the context of this specific dataset or the way the model was constructed.
This analysis provides valuable insights into what factors airlines should focus on to improve passenger satisfaction, particularly emphasizing service quality, both digital and onboard.
## Overall
## Age 168
## Arrival.Delay.in.Minutes 131
## Class 17608
## Ease.of.Online.booking 1671
## Inflight.entertainment 13115
## Inflight.wifi.service 12888
## Leg.room.service 4009
## On.board.service 2179
## Online.boarding 16997
## Type.of.Travel 17087
## Gender 0
## Customer.Type 0
## Flight.Distance 0
## Departure.Arrival.time.convenient 0
## Gate.location 0
## Food.and.drink 0
## Seat.comfort 0
## Baggage.handling 0
## Checkin.service 0
## Inflight.service 0
## Cleanliness 0
## Departure.Delay.in.Minutes 0
Refined Model: A second decision tree (
tree1
) is built focusing only on the significant variables identified earlier.Decision Tree Visualization: The structure of the refined decision tree is visualized using
prp
.
The decision tree shows a simplified model of how different factors contribute to the outcome of passenger satisfaction, which seems to be categorized as either satisfied or neutral.
Interpretation and Implications:
- Online Boarding is a significant determinant of initial satisfaction. A better online boarding experience leads directly to a higher chance of satisfaction, bypassing other factors.
- Inflight Entertainment is the second most crucial factor; however, its impact is nuanced by the previous experience with online boarding.
- Type of Travel being personal indicates a more significant expectation or reliance on Inflight Entertainment for satisfaction.
- It’s worth noting that the tree uses a binary split for satisfied and neutral, implying that dissatisfaction is possibly grouped with neutrality in this analysis, or dissatisfaction was not an outcome in the training data.
Based on this tree, to improve overall passenger satisfaction, an airline should focus on enhancing the online boarding process and the quality of inflight entertainment, especially for those traveling for personal reasons.
The tree simplifies the prediction of satisfaction and does not account for all the nuances or interactions between different factors but provides a quick and interpretable way to understand key drivers of satisfaction.
Model Tuning and Evaluation
Cross-Validation Setup: A 10-fold cross-validation is defined for tuning the complexity parameter (
cp
).Cross-Validation Execution: The model is trained across a range of
cp
values to find the optimal model.
## CART
##
## 95704 samples
## 15 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 86133, 86134, 86133, 86134, 86133, 86134, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01 0.277 0.686 0.153
## 0.02 0.304 0.622 0.185
## 0.03 0.312 0.601 0.195
## 0.04 0.312 0.601 0.195
## 0.05 0.312 0.601 0.195
## 0.06 0.334 0.544 0.223
## 0.07 0.334 0.544 0.223
## 0.08 0.334 0.544 0.223
## 0.09 0.389 0.380 0.303
## 0.10 0.389 0.380 0.303
## 0.11 0.389 0.380 0.303
## 0.12 0.389 0.380 0.303
## 0.13 0.426 0.259 0.362
## 0.14 0.426 0.259 0.362
## 0.15 0.426 0.259 0.362
## 0.16 0.426 0.259 0.362
## 0.17 0.426 0.259 0.362
## 0.18 0.426 0.259 0.362
## 0.19 0.426 0.259 0.362
## 0.20 0.426 0.259 0.362
## 0.21 0.426 0.259 0.362
## 0.22 0.426 0.259 0.362
## 0.23 0.426 0.259 0.362
## 0.24 0.426 0.259 0.362
## 0.25 0.426 0.259 0.362
## 0.26 0.494 NaN 0.489
## 0.27 0.494 NaN 0.489
## 0.28 0.494 NaN 0.489
## 0.29 0.494 NaN 0.489
## 0.30 0.494 NaN 0.489
## 0.31 0.494 NaN 0.489
## 0.32 0.494 NaN 0.489
## 0.33 0.494 NaN 0.489
## 0.34 0.494 NaN 0.489
## 0.35 0.494 NaN 0.489
## 0.36 0.494 NaN 0.489
## 0.37 0.494 NaN 0.489
## 0.38 0.494 NaN 0.489
## 0.39 0.494 NaN 0.489
## 0.40 0.494 NaN 0.489
## 0.41 0.494 NaN 0.489
## 0.42 0.494 NaN 0.489
## 0.43 0.494 NaN 0.489
## 0.44 0.494 NaN 0.489
## 0.45 0.494 NaN 0.489
## 0.46 0.494 NaN 0.489
## 0.47 0.494 NaN 0.489
## 0.48 0.494 NaN 0.489
## 0.49 0.494 NaN 0.489
## 0.50 0.494 NaN 0.489
##
## RMSE was used to select the optimal model using
## the smallest value.
## The final value used for the model was cp = 0.01.
The complexity parameter is a measure of the cost of
adding additional splits to the tree. A smaller cp
value
allows for more splits (i.e., a more complex tree), whereas a larger
cp
value results in fewer splits (i.e., a simpler tree).
The tuning process tested cp
values from 0.01 up to
0.50.
The performance of the model at each cp
value is
evaluated using three metrics. RMSE (Root Mean Squared
Error) measures the standard deviation of the prediction errors
or residuals. Lower values are better as they indicate less deviation
between the predicted and actual values. Rsquared is
the coefficient of determination, indicating the proportion of the
variance in the dependent variable that’s predictable from the
independent variables. Higher values (close to 1) are better.
MAE (Mean Absolute Error) measures the average
magnitude of the errors in a set of predictions, without considering
their direction. Lower values are better.
According to the summary, the optimal model was chosen with a
cp
value of 0.01. This model has the smallest RMSE (0.278),
a reasonably high Rsquared (0.685), and the lowest MAE (0.154),
suggesting that it has the best predictive performance among the models
tested. As cp
increases, the RMSE and MAE tend to increase
while Rsquared decreases, which may indicate that the model becomes too
simple and starts to underfit the data. The optimal cp
of
0.01 suggests that a more complex model performs better on this dataset.
Beyond a cp
value of 0.25, Rsquared values are not
available (NaN), which might indicate that the model performance has
degraded significantly, and the predictions are no longer reliable.
The model was trained on a large sample of 95,704 instances and 15 predictors. The use of 10-fold cross-validation helps to ensure that the evaluation of the model’s performance is robust and not overly dependent on a particular split of the data.
In summary, the CART model performs best with a complexity parameter of 0.01, indicating that a model with more splits (thus more complexity) is better suited to this dataset. This model shows a good balance between bias and variance, with a relatively low prediction error and a decent explanation of variance, as per the given performance metrics.
Model Performance Analysis
- ROC Curve Plotting: The Receiver Operating Characteristic (ROC) curve is plotted to evaluate the model’s true positive rate vs. false positive rate.
The Receiver Operating Characteristic (ROC) curve displayed is a graphical representation used to assess the performance of model. The curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. The TPR is on the y-axis, and the FPR is on the x-axis, indicating the trade-off between benefiting from true positives and suffering from false positives. The curve shows a relatively steep ascent towards the upper left corner and then runs close to the top left corner, which indicates a good level of discrimination between the positive and negative classes. The area under the curve (AUC) is 0.89, a value close to 1, which suggests that the model has a high ability to correctly classify positive and negative cases. The closer the AUC is to 1, the better the model is at predicting true positives while minimizing false positives. An AUC of 0.89 means that there is an 89% chance that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance, which is indicative of good model performance.
- Confusion Matrix and Accuracy: The confusion matrix is used to calculate the model’s accuracy at an optimal threshold identified from the ROC curve.
##
## FALSE TRUE
## 0 12139 1451
## 1 1720 8553
- Performance Metrics Calculation: Key metrics including Accuracy, Sensitivity (Recall), Precision, F-Measure, and Specificity are calculated.
The model’s performance was evaluated using various metrics. The results are as follows:
- Accuracy: 0.867 (The proportion of true results among the total number of cases)
- Precision: 0.833 (The proportion of true positives among all positive predictions)
- Recall: 0.855 (The proportion of true positives among all actual positives)
- F1 Score: 0.844 (The harmonic mean of precision and recall)
- Specificity: 0.893 (The proportion of true negatives among all actual negatives)
- AUC-ROC Value: The Area Under the Curve (AUC) for the ROC is computed, providing a single measure of the model’s overall performance.
#Testing Data AUC-ROC(Area Under the Curve - Receiver operator Characteristics) value
AUC = as.numeric(performance(pred, "auc")@y.values)
- AUC-ROC Value: 0.896
Conclusion
Based on accuracy metrics, we can see that multiple models are reasonably good predictors of customer satisfaction. We have been able to limit multicollinearity through correlation matrices and VIF testing and avoid overfitting through testing/training data splits. In addition, model parameters include multiple different variable levels. This fulfills our initial research question.
While predictive validity varies, there is also an interpretation tradeoff relative to other models; in the linear model, for example, the parameters can be intuitively tied to changes in probability for each unit of an independent variable. Alternatively, the decision tree model and discussions of feature importance can yield different results. This is especially important in the context of practical outcomes; if we were to present such results to an airline executive, different models might provide different amounts of actionable information concerning inputs.
Citations
Klein, TJ (2020). Airline Passenger Satisfaction. Kaggle. https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction?select=train.csv
Lutz, A., & Lubin, G. (2012). Airlines Have An Insanely Small Profit Margin. Business Insider. https://www.businessinsider.com/airlines-have-a-small-profit-margin-2012-6
Hardee, H. (2023). Frontier reports lacklustre Q3 results as it struggles in ‘over-saturated’ core markets. FlightGlobal. https://www.flightglobal.com/strategy/frontier-reports-lacklustre-q3-results-as-it-struggles-in-over-saturated-core-markets/155561.article
vif: Variance Inflation Factors. (n.d.). R Package Documentation. https://rdrr.io/cran/car/man/vif.html
Allison, P. (2015, April 1). What’s So Special About Logit?. Statistical Horizons. https://statisticalhorizons.com/whats-so-special-about-logit/
Assumptions of Logistic Regression. (n.d.). Statistics Solutions. https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptions-of-logistic-regression/
Agarwal, P. (2019, July 8). WHAT and WHY of Log Odds. Towards Data Science. https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704