— 11 Feb 2025

T-Test: Spotting Real Differences, Not Just Coincidences

A T-test is a statistical test used to compare the means of two groups to check if the difference is statistically significant.

In simple terms, it answers:

👉 Are these two groups really different, or is it just random chance?

Why is the T-Test Important in ML & DL?

Comparing two models' performance (e.g., accuracy of Model A vs. Model B).
Evaluating before-and-after effects (e.g., performance with and without data augmentation).
Checking if batch normalization improves training.

Types of T-Tests

Independent (Unpaired) T-Test → Compares two separate groups (e.g., two different models).
Paired T-Test → Compares before and after (e.g., same model before vs. after fine-tuning).

T-Test Formula

The T-score is calculated as:

t = (mean1 - mean2) / sqrt( (var1/n1) + (var2/n2) )

Where:

mean1, mean2 = Means of the two groups
var1, var2 = Variances of the two groups
n1, n2 = Sample sizes

A higher T-score → More difference between groups.
A lower T-score → Groups are similar.

We also calculate a p-value (probability value):
✅ p < 0.05 → Significant difference!
❌ p > 0.05 → No strong evidence of difference.

🔹 Practical Examples

Independent T-Test (Comparing Two Models' Accuracy)

Imagine we trained two models and recorded their accuracies.

✅ NumPy & SciPy Example

import numpy as np
from scipy import stats

# Model A accuracies (5 runs)
model_a = np.array([0.85, 0.87, 0.86, 0.88, 0.85])

# Model B accuracies (5 runs)
model_b = np.array([0.80, 0.82, 0.81, 0.79, 0.83])

# Perform Independent T-Test
t_stat, p_value = stats.ttest_ind(model_a, model_b)

print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Significant difference! One model is better.")
else:
    print("No significant difference. Models perform similarly.")

✅ If p < 0.05, one model significantly outperforms the other.

2️⃣ Paired T-Test (Before vs. After Fine-Tuning)

Let's check if fine-tuning a model improves accuracy.

# Model accuracy BEFORE fine-tuning
before_finetune = np.array([0.75, 0.77, 0.76, 0.78, 0.76])

# Model accuracy AFTER fine-tuning
after_finetune = np.array([0.82, 0.84, 0.83, 0.85, 0.83])

# Perform Paired T-Test
t_stat, p_value = stats.ttest_rel(before_finetune, after_finetune)

print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Significant improvement after fine-tuning!")
else:
    print("No significant improvement.")

✅ If p < 0.05, fine-tuning actually improved the model!

🔹 T-Test in PyTorch

If your model's performance scores are in PyTorch tensors, convert them to NumPy before running a T-Test.

import torch

# Convert PyTorch tensors to NumPy
model_a_torch = torch.tensor([0.85, 0.87, 0.86, 0.88, 0.85])
model_b_torch = torch.tensor([0.80, 0.82, 0.81, 0.79, 0.83])

t_stat, p_value = stats.ttest_ind(model_a_torch.numpy(), model_b_torch.numpy())

print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_value:.4f}")

✅ Works the same way, just using PyTorch tensors!

🔹 When NOT to Use a T-Test

🚫 When data isn’t normally distributed → Use non-parametric tests like the Mann-Whitney U test.
🚫 When sample sizes are very small (<5) → T-test might be unreliable.
🚫 For comparing more than two groups → Use ANOVA test instead.

🔹 Conclusion

T-Tests help compare model performance in a statistically valid way.
Independent T-Test → Compare two different models.
Paired T-Test → Compare before vs. after fine-tuning.
p < 0.05 means significant difference in model performance.