Methods

Data, Data Cleaning, and Data Visualization

We are using a dataset from Kaggle: Cleaned 2018 Flights
This dataset contains more than 9 million cleaned domestic U.S. flight observations and 13 data fields. After removal of irrelevant data fields from the data, we finialized on using the following variables in our model:

Quarter (Categorical)
Airline Company (Categorical)
Origin (Categorical)
Destination (Categorical)
Number of Tickets Ordered (Numerical)

The Model: Linear Regression

Our modeling process contains of three components: construct preprocessor, build model, and validate model. Our modeling process heavily depends on the Python module Scikit-Learn.

Step 1: Construct Preprocessor
There are five predictor variables, and four of them are categorical. Since machine learning models cannot directly interpret categorical values, these variables has to transform into numbers through One-Hot-Encoding. Utilizing the pre-built Scikit-Learn One-Hot-Encoder and ColumnTransformer, a preprocessor was built. The preprocessor takes in a Pandas DataFrame and outputs a 2D array. Both the model training process and the final product will utilize this preprocessor to prepare the data.

Step 2 & 3: Build Model and Validation We used the Amazon SageMaker Studio for model training and deploying. The scalability of Amazon SageMaker allows us to process the relatively large training set. After the categorical features were transformed appropriately, we trained and tested three different machien learning models -- Linear Regression, Support Vector Machine Regression, and Random Forest Regression. We used the pre-built Scikit-Learn to fit the training data. Then, we evaluated the training accuracy and testing accuracy of the model through the Scikit-Learn pre-built score function (R score). Among the three models, the Linear Regression Model has the best accuracy and the best runtime. Therefore, we decided to proceed with the linear regression model.

Model Deployment

Amazon SageMaker allows us to deploy the model to an endpoint. The endpoint stores a trained model, in which it can be accessed and used throught AWS. This model is the main component of our project.