QMAT Initial Analysis (Synthetic Data)

Author

Steven Gambino

For the following report, all data have been synthetically generated to mirror an analysis I recently conducted for QCC. The Python script used to generate this fake dataset is included in the appendix. Some original text and conclusions have been preserved for context, but they do not necessarily match the synthetic data or charts shown in this report.

1 Summary

The QMAT is a placement test administered through Pearson’s Mathlab. Students have the option to practice using Mathlab’s Study Plan and take practice tests at various levels along the STEM pathway.

This analysis explored whether engaging in practice is associated with higher QMAT scores. Students who spent time practicing tended to perform statistically significantly better than those who did not, although the effect was small. Importantly, these findings reflect correlation, not causation-it cannot be concluded that practice directly causes higher scores.

Additionally, machine learning models were developed to predict QMAT performance based on practice metrics. While they captured some relationship between the features and QMAT scores, their predictive power was modest, and most of the variance remains unexplained. It is likely that students’ prior mathematical background, general academic preparation, and overall maturity play a larger role in determining QMAT outcomes than a relatively small amount of practice time.

Finally, this analysis focused exclusively on QMAT scores and does not evaluate the effectiveness of course placement resulting from these scores. Further investigation would be needed to determine whether practicing impacts placement outcomes.

2 Methodology and Results

2.1 Data Preparation

2.1.1 Data Sources:

Two data sets were utilized in this analysis:

  1. QMAT Detailed Results. This includes scores and time spent on each attempt of every “assignment” within the Mathlab QMAT page. The assignments include the welcome module, orientation, practice tests, and the QMAT itself.
  2. Study Plan Results. This includes time spent and mastery points. However, because mastery points are updated based on students’ QMAT performance, these values were excluded from the analysis to avoid circular influence.

Both datasets include email, last name, and first name for each of the students. Email addresses were used as unique identifiers to merge the datasets.

2.1.2 Processing and Formatting

The raw export format used by Mathlab was not well-suited for direct analysis, so several preprocessing steps were required to reorganize and reformat the data. The full details and accompanying R code for this process are provided in the appendix (process_mathlab_export.r).

All time values were converted to minutes for consistency. In the QMAT Detailed Results, time was originally recorded in “hh:mm:ss” format, while in the Study Plan data it appeared as “xxh xxm xxs.”

At this stage, the dataset contained one row per student, with each assignment attempt represented by two columns: Score and Time Spent. Because a small number of students made many attempts on certain assignments, this led to some assignments having up to about 40 pairs of columns-most of which were empty for most students. To improve data usability and reduce sparsity, the dataset was summarized into key metrics for each assignment.

The following summary metrics were computed for each of the assignments:

  1. Number of Attempts = total number of attempts made.
  2. Best Score = highest score across all attempts.
  3. First Score = score of the first attempt.
  4. Average Score = average score across all attempts.
  5. Score Standard Deviation = standard deviation of scores across all attempts.
  6. Total Time = total time spent across all attempts.
  7. Improvement = the difference between first and best score.
  8. Improvement Rate = improvement divided by (number of Attempts - 1)
  9. Efficiency = best score divided by total time

If a student made no attempts for a given assignment, all related values were set to zero. The code for these summarization steps is provided in the appendix (summarize_processed_data.r).

2.1.3 Derived Variables

Several additional variables were created to support the analysis:

  • qmat_sum = Total of all QMAT scores, used for predictive modeling to add granularity to the otherwise discrete scoring scale of QMAT.
  • practice_time = Combined total of time spent across the Study Plan and practice tests.
  • practice = Indicator variable if total practice time > 0 minutes.
  • practice_30 = Indicator variable if total practice time > 30 minutes.
  • practice_60 = Indicator variable if total practice time > 60 minutes.

The following exploratory variables were only used during the initial data exploration and not included in the final statistical analysis:

  • predicted_mastery = Predicted number of mastery points based on a simple linear regression model using total Study Plan time.
  • mp_pred_diff = Difference between predicted and actual mastery points.
  • mp_pred_diff_sign = Indicator for whether a student’s actual mastery points were above or below the modeled rate.

Through these preparation and transformation steps, the datasets were standardized, merged, and reduced to a clean analytical structure suitable for statistical modeling and visualization. The resulting dataset provided a consistent basis for exploring relationships between practice activity and QMAT performance, described in the following section.

2.2 Data Exploration and Visualization

An initial review of the summary metrics revealed several expected, but important patterns:

  • Non-normal distributions: Most variables, including scores and time measures, show skewed distributions rather than normal ones.
  • Limited practice engagement: The majority of students either did not use the practice tools or spent only minimal time on them.
  • Lower participation upper level: Very few students reached or practiced within the upper-level QMAT sections (Precalculus, Trigonometry).

Use the drop-down menu below to view the histograms for each assignment of the summary metrics.

Note: Recall that these charts are generated from synthetic data.





To further explore potential relationships between practice activity and QMAT performance, pairwise correlations were examined between the summary metrics of each practice test and each section of QMAT.

A selection of these correlation plots is shown below. Overall, most relationships appear weak or non-existent, suggesting that performance on practice tests does not consistently predict outcomes on the QMAT. One of the strongest observed relationships occurs between the College Algebra practice test and the College Algebra portion of the QMAT, indicating some alignment between preparation and assessed performance in that area.

Use the dropdown menu below to view the selected correlation plots for various practice tests and QMAT sections.

Note: Recall that these charts are generated from synthetic data. Additionally, the selection was reduced from 6 to 2.





In these plots, darker blue areas represent stronger correlations between variables. Notice that the darkest blue shading appears primarily in the top-left and bottom-right quadrants. If practice test metrics were strongly correlated with QMAT performance, we would expect to see similarly dark areas in the top-right and bottom-left quadrants as well.

Instead, most plots show only a faint or minimal shading, indicating weak relationships at best. A few, such as College Algebra Practice vs. College Algebra QMAT, suggest a somewhat stronger connection, but overall correlations remain modest.


The weak correlations observed earlier can be further illustrated through scatter plots. For instance, plotting QMAT Sum (the total of scores across all QMAT sections) against total practice time shows no clear or consistent pattern. This suggests that increased practice time alone does not strongly predict overall QMAT performance.

Note: Recall that these charts are generated from synthetic data.

2.3 Statistical Analysis

The statistical analysis aimed to evaluate the following question: Do students who engage in practice perform better on the QMAT than those who do not?

To examine this, three thresholds of total practice time were tested:

  • More than 0 minutes
  • More than 30 minutes
  • More than 60 minutes

The analysis was conducted for two groups:

  1. All students enrolled in the QMAT Mathlab course, and
  2. A subset consisting only of students who made at least one attempt on the first section of the QMAT.

Group sizes for each category are reported below:

Note: Recall that these tables are generated from synthetic data.

All Students Enrolled
No Yes
Practice Time > 0 min 685 1343
Practice Time > 30 min 1526 502
Practice Time > 60 min 1635 393
Students who Attempted QMAT Beg Algebra
No Yes
Practice Time > 0 min 342 649
Practice Time > 30 min 752 239
Practice Time > 60 min 808 183


A Mann-Whitney U test (also known as a Wilcoxon rank-sum test) was used for this comparison. This nonparametric test was selected because QMAT scores are discrete, the variables are not normally distributed, and the goal was to determine whether the practice group generally ranked higher in performance than the non-practice group.

The results are summarized in the table below:

Note: Recall that these tables are generated from synthetic data.

All Students Enrolled
Comparison W p_value Effect_Size Significance
Practice > 0 min 454978.5 6.61e-01 0.0097427 ns
Practice > 30 min 367708.5 1.41e-01 0.0327143 ns
Practice > 60 min 304891.5 8.53e-02 0.0382118 ns
Students who Attempted QMAT Beg Algebra
Comparison W p_value Effect_Size Significance
Practice > 0 min 106503.0 2.55e-01 0.0361704 ns
Practice > 30 min 87004.5 4.19e-01 0.0256791 ns
Practice > 60 min 69359.0 1.54e-01 0.0452760 ns


When examining all students enrolled in the QMAT Mathlab course, statistical significance was observed for the 30-minute and 60-minute practice thresholds, but not for the 0-minute threshold. This suggets that students who practiced for at least 30 minutes tended to perform better on the QMAT than those who practiced minimally or did not practice at all.

Among the subset of students who attempted the QMAT, statistically significant differences were found across all three thresholds. This indicates that, within the group of active test-takers, those who engaged in practice activities generally achieved higher QMAT scores.

Correlation ≠ Causation

These findings do not imply that practice caused higher scores. It is also plausible that students who were already more motivated or better prepared were also more likely to spend time practicing. This analysis identifies an association, not a causal relationship.

While the differences between groups are statistically significant, the effect sizes are modest. In practical terms, the observed performance gap – though measurable – may not represent a substantial difference in student outcomes.

Note: Recall that these charts are generated from synthetic data.

The following bar graphs illustrate the proportional distribution of QMAT scores for each group:

All Students Enrolled

Students with QMAT Attempts

2.4 Predictive Models

Feature and Predictor Selection

For each assignment (Welcome, Orientation, and each Practice Test), the nine summary metrics described previously, along with the total time spent in the Study Plan, were used as features to train three machine learning models.

Instead of using the discrete QMAT scores, the QMAT Sum variable (the total of scores across all QMAT sections) was used as the target. This continuous, more granular variable is better suited for predictive modeling while still effectively capturing overall performance. Using QMAT Sum also helps distinguish between students who might have otherwise tied scores, providing additional nuance for the models.

Training Procedure and Evaluation Measures

The dataset was randomly split into a training set (70% of students) and a testing set (the remaining 30%). Models were trained using ten-fold cross-validation on the training set and subsequently evaluated on the testing set using the following metrics:

  • Mean Absolute Error (MAE)
  • Root Mean Square Error (RMSE)
  • Adjusted R-Squared (Adj. R-Sq)

Model fitting was performed twice: once using all available data and a second time using only students who had a recorded attempt on the QMAT Beginning Algebra section, mirroring the analysis conducted in the statistical section above.

All Students Enrolled
model mae rmse rsq
gbm test 0.5174626 0.8166200 0.1796506
gbm train 0.4853880 0.7259209 0.2031423
lm test 0.5167863 0.8347267 0.1561633
lm train 0.4672163 0.7072809 0.2368030
rf test 0.4868968 0.7686207 0.2953535
rf train 0.4409694 0.6556008 0.3827244
Students who Attempted QMAT Beg Algebra
model mae rmse rsq
gbm test 0.6338104 0.9176171 0.2603417
gbm train 0.5937731 0.8040633 0.3199446
lm test 0.6588326 1.0022770 0.1628147
lm train 0.5519571 0.7543600 0.3985241
rf test 0.6158388 0.8912548 0.3176298
rf train 0.5568578 0.7538794 0.4214947

These metrics help assess how well the models predict QMAT scores:

MAE (Mean Absolute Error): The average difference between the predicted and actual values.

RMSE (Root Mean Square Error): Similar to MAE, but penalizes larger errors more heavily, making it more sensitive to outliers.

RSQ (Adjusted R-Squared): The proportion of variance in QMAT scores explained by the model, ranging from 0 to 1.

Large differences between training and testing metrics for a given model may indicate overfitting, meaning the model may not generalize well to new data.

Model Evaluation

Overall, the predictive models yielded modest results. The Random Forest models performed best, achieving R^2 values of 0.295 and 0.217 respectively. However, these values represent substantial drops compared to their training sets, suggesting some degree of overfitting. While there is a detectable relationship between the features and QMAT scores, the majority of variance in QMAT performance is influenced by factors not captured in this dataset.

The high MAE further highlights the limitations of these models. An average prediction error of 60% or more on exam sections scaling from 0-100% indicates that these models are insufficiently accurate for reliably predicting individual QMAT results.

3 Conclusions

The analysis provides some evidence that students who practice for the QMAT tend to perform slightly better than those who do not. These differences are statistically significant, indicating a measurable association between practice and QMAT Scores. However, the effect size is small and the practice metrics available explain only a small portion of the variance in QMAT performance and, on their own, do not serve as strong predictors of outcomes. Additionally, because this study is observational, it cannot establish that practice causes higher scores–students who practice more may already be more prepared or motivated.

It is also important to note that this study addresses only whether practice correlates with higher QMAT scores. It does not evaluate the impact on course placement, which is the primary purpose of QMAT. Whether the small performance gains from practicing lead to more effective course placement remains an open question and will require further investigation.

4 Appendix

4.1 process_mathlab_export.r

Show the code
library(tidyverse)
library(stringr)
library(lubridate)

# Set data directory
data_dir <- "/Users/stevengambino/Desktop/Fake QMAT Project II copy/Data"

# Find files from directory
file <- paste0(data_dir, "/qmat data.csv")

print(paste("Reading file:", file))

# Read in CSV as characters
df <- read_csv(file)
               # ,col_types = cols(.default = "c"))

#---------------------------------------------
# Step 1: Grab date info
#---------------------------------------------
date <- colnames(df)[2]
date <- as.character(date)
date <- unlist(strsplit(date, " "))[1]
print(paste("Extracted date:", date))

#---------------------------------------------
# Step 2: Find header row (where "Last name" starts)
#---------------------------------------------
header_row <- which(df[[1]] == "Last name")[1]
if (is.na(header_row)) stop("No 'Last name' header found in file!")
print(paste("Header row found at:", header_row))

# Keep only relevant rows (the first metadata row + actual data)
df <- df[c(1, (header_row-1):nrow(df)), , drop = FALSE]

#---------------------------------------------
# Step 3: Fill values across the first row
#---------------------------------------------
first_row <- as.vector(unlist(df[1, ]))
first_row[first_row == "NA"] <- NA  # convert string "NA" to real NA

print("Original first row (with NAs):")
print(first_row)

# Fill in missing values left-to-right
for (i in seq_along(first_row)) {
  if (is.na(first_row[i]) && i > 1) {
    first_row[i] <- first_row[i - 1]
  }
}

print("Cleaned first row (after filling NAs):")
print(first_row)

# Replace in df
df <- as.data.frame(df, stringsAsFactors = FALSE)
df[1, ] <- first_row

#---------------------------------------------
# Step 3: Fill values across the second row
#---------------------------------------------
second_row <- as.vector(unlist(df[2, ]))
second_row[second_row == "NA"] <- NA  # convert string "NA" to real NA

print("Original second row (with NAs):")
print(second_row)

# Fill in missing values left-to-right
for (i in seq_along(second_row)) {
  if (is.na(second_row[i]) && i > 2) {
    second_row[i] <- second_row[i - 1]
  }
}

print("Cleaned second row (after filling NAs):")
print(second_row)

# Replace in df
df <- as.data.frame(df, stringsAsFactors = FALSE)
df[2, ] <- second_row

#---------------------------------------------
# Step 4: Combine Assignment, Attempt, and Score/Time for proper headings. 
#---------------------------------------------
df[3, ] <- apply(df[c(1, 3, 2), ], 2, function(x) paste(na.omit(x), collapse = " "))

#---------------------------------------------
# Step 5: Remove 1st 2 rows
#---------------------------------------------
df <- df[3:nrow(df), ]

#---------------------------------------------
# Step 6: Clean up header names
#---------------------------------------------
# Check unique names
header_names <- unique(df[1, ])
# Make list of replacements to make to clean up header names
replacements <- c(
  "Step 1: Welcome to QMAT!" = 'welcome', 
  "Step 2: Practice Tests Orientation" = "orientation", 
  "PRACTICE TEST - Pre-Algebra Skills" = "ptest_pre_algebra", 
  "PRACTICE TEST - Beginning Algebra (Possible QMAT Score 11 - 21)" = "ptest_beg_algebra", 
  "PRACTICE TEST - Intermediate Algebra (Possible QMAT Score 22 - 32)" = "ptest_int_algebra", 
  "PRACTICE TEST - College Algebra (Possible QMAT Score 33 - 43)" = "ptest_col_algebra", 
  "PRACTICE TEST - PreCalculus (Possible QMAT Score 44 - 52)" = "ptest_precalc", 
  "PRACTICE TEST - Trigonometry (Possible QMAT Score greater than 52)" = "ptest_trig", 
  "Step 2: Start the QMAT Placement Test" = "qmat_intro",
  "TEST - QMAT - Beginning Algebra" = "qmat_beg_algebra", 
  "TEST - QMAT - Intermediate Algebra" = "qmat_int_algebra", 
  "TEST - QMAT - College Algebra" = "qmat_col_algebra", 
  "TEST - QMAT - PreCalculus" = "qmat_precalc", 
  "TEST - QMAT - Trigonometry" = "qmat_trig", 
  "Time Spent" = "time"
)
# Lowercase both patterns (names) and replacements (values)
replacements_lower <- setNames(tolower(replacements), tolower(names(replacements)))
# apply replacements
df[1, ] <- tolower(df[1, ])
# loop and replace using fixed() so parentheses, +, etc. are treated literally
for (pat in names(replacements)) {
  df[1, ] <- str_replace_all(df[1, ], fixed(tolower(pat)), tolower(replacements[[pat]]))
}
df[1, 99]
df[1, ] <- str_replace_all(df[1, ], c(" " = "_"))


#---------------------------------------------
# Step 7: Rename columns from what’s now the first row
#---------------------------------------------
colnames(df) <- as.character(unlist(df[1, ]))

#---------------------------------------------
# Step 8: Replace missing or blank column names
#---------------------------------------------
names(df)[names(df) == "" | is.na(names(df))] <- paste0("V", seq_len(sum(names(df) == "" | is.na(names(df)))))
# Maybe switch to this instead if there are issues:
# missing <- which(names(df) == "" | is.na(names(df)))
# names(df)[missing] <- paste0("V", seq_along(missing))

#---------------------------------------------
# Step 9: Remove the now-header row
#---------------------------------------------
df <- df[-1, ]

#---------------------------------------------
# Step 10: Filter out unnecessary “dull rows”
#---------------------------------------------
dull_rows <- c("Course:", "Course ID:", "Book:", "Instructor Name:", "Enrollment:", "Date of export:")

df <- df %>%
  filter(!(last_name %in% dull_rows))

#---------------------------------------------
# Step 11: Clean up placement scores
#---------------------------------------------
df <- df %>%
  mutate(assignment_placement_attempt = str_replace_all(assignment_placement_attempt, c("QMAT Score is " = "")), 
         assignment_placement_attempt = ifelse(is.na(assignment_placement_attempt), 0, assignment_placement_attempt))

#---------------------------------------------
# Step 12: Create function for converting times to minutes
#---------------------------------------------
time_to_minutes <- function(x) {
  sapply(x, function(t) {
    if (is.na(t) || t == "") return(0)
    parts <- strsplit(t, ":")[[1]]
    # Handle HH:MM:SS or MM:SS formats
    parts <- as.numeric(parts)
    if (length(parts) == 3) {
      return(parts[1] * 60 + parts[2] + parts[3] / 60)
    } else if (length(parts) == 2) {
      return(parts[1] + parts[2] / 60)
    } else {
      return(0)
    }
  })
}

#---------------------------------------------
# Step 13: convert all times to minutes using above function
#---------------------------------------------
df <- df %>%
  mutate(across(contains("time", ignore.case = TRUE), as.character)) %>%
  mutate(across(contains("time", ignore.case = TRUE), time_to_minutes))

#---------------------------------------------
# Step 14: rename some columns
#---------------------------------------------
df <- df %>%
  rename(
    total_time = assignment_total_time_attempt, 
    qmat_score = assignment_placement_attempt
  )

write_csv(df, '/Users/stevengambino/Desktop/Fake QMAT Project II copy/Data/processed_qmat_data.csv')

4.2 summarize_processed_data.r

Show the code
library(tidyverse)

df <- read_csv('/Users/stevengambino/Desktop/Fake QMAT Project II copy/Data/processed_qmat_data.csv')

# Grab emails, total_time, and qmat_score to join back in later
df_totals <- df %>%
  select(email, total_time, qmat_score)

# Items Lists
assignments_list <- c('welcome', 
                      'orientation', 
                      'ptest_pre_algebra', 
                      'ptest_beg_algebra', 
                      'ptest_int_algebra', 
                      'ptest_col_algebra', 
                      'ptest_precalc', 
                      'ptest_trig', 
                      'qmat_beg_algebra', 
                      'qmat_int_algebra', 
                      'qmat_col_algebra', 
                      'qmat_precalc', 
                      'qmat_trig')

#######################
# Summarize Each Item
#######################
# ---- Function to summarize one assignment ----
summarize_assignment <- function(df, assignment_prefix) {
  
  # Identify columns for this assignment
  score_cols <- df %>%
    select(matches(paste0("^", assignment_prefix, "_score_\\d+$"))) %>%
    names()
  
  time_cols <- df %>%
    select(matches(paste0("^", assignment_prefix, "_time_\\d+$"))) %>%
    names()
  
  # Skip if there are no matching columns
  if (length(score_cols) == 0 | length(time_cols) == 0) {
    message("Skipping: ", assignment_prefix, " (no matching columns found)")
    return(NULL)
  }
  
  df %>%
    rowwise() %>%
    mutate(
      num_attempts = sum(!is.na(c_across(all_of(score_cols)))),
      best_score = {
        val <- max(c_across(all_of(score_cols)), na.rm = TRUE)
        ifelse(is.infinite(val), NA_real_, val)
      },
      first_score = first(c_across(all_of(score_cols))[!is.na(c_across(all_of(score_cols)))]),
      avg_score = mean(c_across(all_of(score_cols)), na.rm = TRUE),
      score_sd = sd(c_across(all_of(score_cols)), na.rm = TRUE),
      total_time = sum(c_across(all_of(time_cols)), na.rm = TRUE),
      improvement = best_score - first_score,
      improvement_rate = improvement / ifelse(num_attempts > 1, num_attempts - 1, 1),
      efficiency = best_score / ifelse(total_time > 0, total_time, NA_real_)
    ) %>%
    ungroup() %>%
    select(email, 
           num_attempts, 
           best_score,
           first_score,
           avg_score,
           score_sd, 
           total_time,
           improvement,
           improvement_rate,
           efficiency) %>%
    rename_with(~ paste0(assignment_prefix, "_", .x), -email)
}

# ---- Apply to all assignments ----
summary_list <- map(assignments_list, ~ summarize_assignment(df, .x))

# Remove any NULLs (from skipped assignments)
summary_list <- compact(summary_list)

# Merge all summaries by email
summary_df <- reduce(summary_list, left_join, by = "email") %>%
  left_join(df_totals, by='email')

# Final summarized dataset
summary_df

# Write to csv
write_csv(summary_df, '/Users/stevengambino/Desktop/Fake QMAT Project II copy/Data/summarized_qmat_data.csv')

4.3 Whitney-Mann U Tests Code

Show the code
library(rstatix)

# Run Wilcoxon tests
tests <- tibble(
  Comparison = c("Practice > 0 min", "Practice > 30 min", "Practice > 60 min"),
  W = c(
    wilcox.test(qmat_score ~ practice, data = df)$statistic,
    wilcox.test(qmat_score ~ practice_30, data = df)$statistic,
    wilcox.test(qmat_score ~ practice_60, data = df)$statistic
  ),
  p_value = c(
    wilcox.test(qmat_score ~ practice, data = df)$p.value,
    wilcox.test(qmat_score ~ practice_30, data = df)$p.value,
    wilcox.test(qmat_score ~ practice_60, data = df)$p.value
  ),
  Effect_Size = c(
    wilcox_effsize(df, qmat_score ~ practice)$effsize,
    wilcox_effsize(df, qmat_score ~ practice_30)$effsize,
    wilcox_effsize(df, qmat_score ~ practice_60)$effsize
  )
) %>%
  mutate(Significance = case_when(
    p_value < 0.001 ~ "***",
    p_value < 0.01  ~ "**",
    p_value < 0.05  ~ "*",
    TRUE ~ "ns"
  ))

# Format p-values in scientific notation
tests$p_value <- formatC(tests$p_value, format = "e", digits = 2)

4.4 initial_analysis.r

Show the code
library(tidyverse)
library(stringr)
library(rstatix)
library(caret)
library(yardstick)


#------------
# Load Data
#------------
df <- read_csv('/Users/stevengambino/Desktop/Fake QMAT Project II copy/Data/summarized_qmat_data.csv')
study_plan_df <- read_csv('/Users/stevengambino/Desktop/Fake QMAT Project II copy/Data/processed_study_plan_data.csv')

#------------
# Prep Data
#------------
# Replace NA scores and times with 0
df[is.na(df)] <- 0

# Create a Qmat Sum for continuous QMAT Score
qmats <- c('qmat_beg_algebra', 
            'qmat_int_algebra', 
            'qmat_col_algebra', 
            'qmat_precalc', 
            'qmat_trig')

suffix = '_best_score'

sum_across <- paste0(qmats, suffix)

df <- df %>% 
  mutate(qmat_sum = rowSums(select(., all_of(sum_across)), na.rm = TRUE))

# Join Study Plan Data
df <- df %>% 
  left_join(study_plan_df, by='email')

# Compute Total Practice Time
practice_time_list <- c('ptest_pre_algebra_total_time', 
                        'ptest_beg_algebra_total_time', 
                        'ptest_int_algebra_total_time', 
                        'ptest_col_algebra_total_time', 
                        'ptest_precalc_total_time', 
                        'ptest_trig_total_time', 
                        'study_plan_total_time')

# 1 = Practice, 0 = No Practice
df <- df %>% 
  rowwise() %>%
  mutate(practice_time = sum(across(practice_time_list))) %>%
  ungroup() %>%
  mutate(practice = ifelse(practice_time > 0, 1, 0)) %>%
  mutate(practice_30 = ifelse(practice_time > 30, 1, 0)) %>%
  mutate(practice_60 = ifelse(practice_time > 60, 1, 0))

write_csv(df, '/Users/stevengambino/Desktop/Fake QMAT Project II copy/Data/final_for_analysis.csv')

#-----------------------------------
# Mann-Whitny U Non-paremetric test
#-----------------------------------
# For practice time > 0
mnu <- wilcox.test(qmat_score ~ practice, data = df)
print(mnu)

# For practice time > 30
mnu_30 <- wilcox.test(qmat_score ~ practice_30, data = df)
print(mnu_30)

# For practice time > 60
mnu_60 <- wilcox.test(qmat_score ~ practice_60, data = df)
print(mnu_60)

# Effect Size
wilcox_effsize(df, qmat_score ~ practice_60)


# Using QMAT Sum instead of raw score
# For practice time > 0
mnu <- wilcox.test(qmat_sum ~ practice, data = df)
print(mnu)

# For practice time > 30
mnu_30 <- wilcox.test(qmat_sum ~ practice_30, data = df)
print(mnu_30)

# For practice time > 60
mnu_60 <- wilcox.test(qmat_sum ~ practice_60, data = df)
print(mnu_60)

# Effect Size
wilcox_effsize(df, qmat_sum ~ practice_60)



#---------------------------------------
# Machine Learning to Predict QMAT Sum?
#---------------------------------------
# Select columns
prefixes <- c(
  "qmat_beg_algebra", 
  "qmat_int_algebra", 
  "qmat_col_algebra", 
  "qmat_precalc", 
  "qmat_trig"
)
df_ml <- df %>%
  select(-email, -qmat_score, -practice, -practice_30, -practice_60, -mastery_points, -total_time) %>%
  select(-starts_with(prefixes))

# Split 70/30 train/test
set.seed(123) 
train <- df_ml[sample(nrow(df_ml), round(0.70 * nrow(df_ml))) ,] 
test <- df_ml[setdiff(seq_len(nrow(df_ml)), rownames(train)), ]

# set cv protocol
control <- trainControl(method = "repeatedcv", # repeated cv for better accuracy metrics
                        number = 10,
                        repeats = 3,
                        savePredictions = TRUE)

form <- qmat_sum ~ .

# Train a linear regression model
fit_lm <- train(form,
                 method = "lm", data = train,
                 trControl = control)

# Train a random forest model
fit_rf <- train(form,
                 method = "rf", data = train,
                 trControl = control)

# Train a gradient boosting machine
fit_gbm <- train(form,
                  method = "gbm", data = train,
                  trControl = control)


#---------------------
# Evaluate ML Models
#---------------------
# Function to compute metrics
compute_metrics <- function(model, data, model_name) {
  # Add predictions to data frame
  data <- data %>%
    mutate(prediction = predict(model, data))
  
  # Compute metrics and add model name
  metrics(data, truth = qmat_sum, estimate = prediction) %>%
    mutate(model = model_name)
}

# Compute metrics for all models
models <- list(fit_lm, fit_rf, fit_gbm)
model_names <- c("lm train", "rf train", 'gmb train')

metrics_df <- purrr::map2_df(models, model_names, compute_metrics, data = train)

# View the results
metrics_df

# add predictions of model back into training set
train$`Linear Regression` <- predict(fit_lm, train)
train$`Random Forest` <- predict(fit_rf, train)
train$`Gradient Boost` <- predict(fit_gbm, train)

# Evaluate the performance on training set
train_metrics <- bind_rows(
  metrics(train, truth = qmat_sum, estimate = `Linear Regression`) %>% mutate(model="lm train"),
  metrics(train, truth = qmat_sum, estimate = `Random Forest`) %>% mutate(model = "rf train"),
  metrics(train, truth = qmat_sum, estimate = `Gradient Boost`) %>% mutate(model = 'gbm train')
)

# Create the new columns
test$`Linear Regression` <- predict(fit_lm, test)
test$`Random Forest` <- predict(fit_rf, test)
test$`Gradient Boost` <- predict(fit_gbm, test)


# Evaluate the performance on test set
test_metrics <- bind_rows(
  metrics(test, truth = qmat_sum, estimate = `Linear Regression`) %>% mutate(model = "lm test"),
  metrics(test, truth = qmat_sum, estimate = `Random Forest`) %>% mutate(model = "rf test"),
  metrics(test, truth = qmat_sum, estimate = `Gradient Boost`) %>% mutate(model = 'gbm test')
)

train_metrics

# Combine and format the results
result_df <- bind_rows(train_metrics, test_metrics) %>%
  dplyr::select(-.estimator) %>%
  arrange(.metric, model) %>%
  spread(.metric, .estimate)

result_df

write_csv(result_df, '/Users/stevengambino/Desktop/Fake QMAT Project II copy/Data/model_results_all.csv')

4.5 generate_fake_qmat_data.ipynb

Show the code
import pandas as pd
import numpy as np

directory = '/Users/stevengambino/Desktop/Fake Fake QMAT Project II copy copy/Data/'

# Load data
df = pd.read_csv(f"{directory}final_for_analysis.csv")

df = df.drop(columns=['email', 'practice', 'practice_30', 'practice_60'])

for col in df.columns:
    # Detect integer columns
    if pd.api.types.is_integer_dtype(df[col]):
        df_random[col] = np.random.choice(df[col], size=len(df), replace=True)
    else:  # float column
        df_random[col] = np.random.choice(df[col], size=len(df), replace=True)
        df_random[col] += np.random.normal(0, 0.1, size=len(df))

    # Ensure non-negativity
    df_random[col] = df_random[col].clip(lower=0)
    
# df_random = df_random.drop(columns='first_name')

# Generate a column of fake email addresses in case other R code written 
# relies on the email column implicitly
num_rows = len(df_random)
df_random['email'] = [f"pseudo{i+1}@fakecollege.edu" for i in range(num_rows)]

df_random.head()

# Add back in practice, practice_30, practice_60
# each computed based on practice_time
df_random['practice'] = np.where(df_random['practice_time'] > 0, 1, 0)
df_random['practice_30'] = np.where(df_random['practice_time'] > 30, 1, 0)
df_random['practice_60'] = np.where(df_random['practice_time'] > 60, 1, 0)


# Write to csv
df_random.to_csv(f"{directory}fake_final_data.csv", index = False)