Model Comparisons

For this project, we aimed to predict student test scores for grades 3-8. We chose to build three different models:

  1. Linear regression
  2. Random forest
  3. Boosted trees

The final fits for each of the models are listed in the table below:

df <- tibble(model = c("linear regression", "random forest", "boosted tree"),
             rmse = c(86.98, 87.25, 85.76))
print(df)
## # A tibble: 3 x 2
##   model              rmse
##   <chr>             <dbl>
## 1 linear regression  87.0
## 2 random forest      87.2
## 3 boosted tree       85.8

The simplest model and the least computationally demanding was the linear regression, which resulted in the second best RMSE. The random forest, was computationally demanding (about 40 minutes) and had worse predictive power than the linear regression. The best fitting model was the boosted tree, with an RMSE of 85.76. Though the boosted tree was the best performing model, it was the most computationally demanding. Tuning just 4 of the many hyperparameters took over 40 minutes to complete and did not result in significant reduction in RMSE compared to the default model.

It is important to note the data set used for the random forest and boosted tree models was only 10% of the total observations, while the data set for the linear model included all observations. Sampling of the entire data set was stratified by the outcome measure to reduce model variance, however, it is possible that stratified sampling still limited the performance of the random forest and boosted tree models. If we were to fit a final boosted tree or random forest model to the entire training data, we may find that our final predictions (i.e.ย on test.csv) are considerably more accurate.

Taking all things into consideration, we decided to use the boosted tree for our final model.

Generating Final Predictions

Preparing the final test data

# Import joined data file
full_train <- import(here("data","data.csv"),
               setclass = "tbl_df") %>% 
  mutate(tst_dt = lubridate::mdy_hm(tst_dt)) %>% 
  mutate(tst_dt = lubridate::round_date(tst_dt, unit = "day"))

# Import joined test file
final_test <- import(here("data","test_joined.csv"),
                     setclass = "tbl_df") %>% 
  mutate(tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  mutate(tst_dt = lubridate::round_date(tst_dt, unit = "day"))
rec <- recipe(score ~ ., full_train) %>%  
    #step_mutate(tst_dt = lubridate::mdy(tst_dt)) %>%
    update_role(contains("id"), ncessch, sch_name, new_role = "id vars") %>%
    step_novel(all_nominal()) %>%
    step_unknown(all_nominal()) %>%
    step_zv(all_predictors()) %>%
    step_normalize(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
    step_BoxCox(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
    step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
    step_dummy(all_nominal(), -has_role("id vars"), one_hot = TRUE) %>%
    step_zv(all_predictors())
## Train data ##

# Dake data
full_train_baked <- rec %>%
    prep() %>%
    bake(full_train)

# Transform tst_dt to numeric
full_train_baked$tst_dt <- as.numeric(full_train_baked$tst_dt)

# Transform preprocessed data into feature matrix
full_train_feat <- full_train_baked %>% 
  select(-score, -contains("id"), -ncessch, -sch_name) %>% 
  as.matrix()

# Saving the scores as a separate vector
train_outcome <- full_train_baked$score


## Test data ##

# Bake data
final_test_baked <- rec %>%
    prep() %>%
    bake(final_test)

# Transform tst_dt to numeric
final_test_baked$tst_dt <- as.numeric(final_test_baked$tst_dt)

# Transform preprocessed data into feature matrix
final_test_feat <- final_test_baked %>% 
  select(-contains("id"), -ncessch, -sch_name) %>% 
  as.matrix()

Final model fit

fit_final_xgb <- xgboost(
   data = full_train_feat,
   label = train_outcome,
   nrounds = 1000, 
   objective = "reg:squarederror",
   early_stopping_rounds = 20, 
   params = list(eta = .15,
                 max_depth = 3,
                 colsample_bytree = .1),
   verbose = 0
 ) 
# Make predictions
predictions <- predict(fit_final_xgb, final_test_feat)

# Create dataframe with predictions
pred_frame <- tibble(Id = final_test_baked$id, Predicted = predictions)
head(pred_frame)
## # A tibble: 6 x 2
##      Id Predicted
##   <int>     <dbl>
## 1     4     2546.
## 2     6     2553.
## 3     8     2602.
## 4     9     2507.
## 5    11     2511.
## 6    15     2523.
# Check for duplicates 
sum(duplicated(pred_frame))
## [1] 5
# Remove duplicates
pred_frame <- pred_frame %>% 
  filter(!duplicated(pred_frame$Id))

# Save predictions
write_csv(pred_frame, "data/final_fits.csv")