T-Test: Spotting Real Differences, Not Just Coincidences
A T-test is a statistical test used to compare the means of two groups to check if the difference is statistically significant.
In simple terms, it answers:
π Are these two groups really different, or is it just random chance?
Why is the T-Test Important in ML & DL?
- Comparing two models' performance (e.g., accuracy of Model A vs. Model B).
- Evaluating before-and-after effects (e.g., performance with and without data augmentation).
- Checking if batch normalization improves training.
Types of T-Tests
- Independent (Unpaired) T-Test β Compares two separate groups (e.g., two different models).
- Paired T-Test β Compares before and after (e.g., same model before vs. after fine-tuning).
T-Test Formula
The T-score is calculated as:
t = (mean1 - mean2) / sqrt( (var1/n1) + (var2/n2) )
Where:
- mean1, mean2 = Means of the two groups
- var1, var2 = Variances of the two groups
- n1, n2 = Sample sizes
A higher T-score β More difference between groups.
A lower T-score β Groups are similar.
We also calculate a p-value (probability value):
β
p < 0.05 β Significant difference!
β p > 0.05 β No strong evidence of difference.
πΉ Practical Examples
Independent T-Test (Comparing Two Models' Accuracy)
Imagine we trained two models and recorded their accuracies.
β NumPy & SciPy Example
import numpy as np
from scipy import stats
# Model A accuracies (5 runs)
model_a = np.array([0.85, 0.87, 0.86, 0.88, 0.85])
# Model B accuracies (5 runs)
model_b = np.array([0.80, 0.82, 0.81, 0.79, 0.83])
# Perform Independent T-Test
t_stat, p_value = stats.ttest_ind(model_a, model_b)
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
# Interpretation
if p_value < 0.05:
print("Significant difference! One model is better.")
else:
print("No significant difference. Models perform similarly.")
β If p < 0.05, one model significantly outperforms the other.
2οΈβ£ Paired T-Test (Before vs. After Fine-Tuning)
Let's check if fine-tuning a model improves accuracy.
# Model accuracy BEFORE fine-tuning
before_finetune = np.array([0.75, 0.77, 0.76, 0.78, 0.76])
# Model accuracy AFTER fine-tuning
after_finetune = np.array([0.82, 0.84, 0.83, 0.85, 0.83])
# Perform Paired T-Test
t_stat, p_value = stats.ttest_rel(before_finetune, after_finetune)
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
# Interpretation
if p_value < 0.05:
print("Significant improvement after fine-tuning!")
else:
print("No significant improvement.")
β If p < 0.05, fine-tuning actually improved the model!
πΉ T-Test in PyTorch
If your model's performance scores are in PyTorch tensors, convert them to NumPy before running a T-Test.
import torch
# Convert PyTorch tensors to NumPy
model_a_torch = torch.tensor([0.85, 0.87, 0.86, 0.88, 0.85])
model_b_torch = torch.tensor([0.80, 0.82, 0.81, 0.79, 0.83])
t_stat, p_value = stats.ttest_ind(model_a_torch.numpy(), model_b_torch.numpy())
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_value:.4f}")
β Works the same way, just using PyTorch tensors!
πΉ When NOT to Use a T-Test
π« When data isnβt normally distributed β Use non-parametric tests like the Mann-Whitney U test.
π« When sample sizes are very small (<5) β T-test might be unreliable.
π« For comparing more than two groups β Use ANOVA test instead.
πΉ Conclusion
- T-Tests help compare model performance in a statistically valid way.
- Independent T-Test β Compare two different models.
- Paired T-Test β Compare before vs. after fine-tuning.
- p < 0.05 means significant difference in model performance.