β€” 11 Feb 2025

T-Test: Spotting Real Differences, Not Just Coincidences

A T-test is a statistical test used to compare the means of two groups to check if the difference is statistically significant.

In simple terms, it answers:

πŸ‘‰ Are these two groups really different, or is it just random chance?

Why is the T-Test Important in ML & DL?

  • Comparing two models' performance (e.g., accuracy of Model A vs. Model B).
  • Evaluating before-and-after effects (e.g., performance with and without data augmentation).
  • Checking if batch normalization improves training.

Types of T-Tests

  • Independent (Unpaired) T-Test β†’ Compares two separate groups (e.g., two different models).
  • Paired T-Test β†’ Compares before and after (e.g., same model before vs. after fine-tuning).

T-Test Formula

The T-score is calculated as:

t = (mean1 - mean2) / sqrt( (var1/n1) + (var2/n2) )

Where:

  • mean1, mean2 = Means of the two groups
  • var1, var2 = Variances of the two groups
  • n1, n2 = Sample sizes

A higher T-score β†’ More difference between groups.
A lower T-score β†’ Groups are similar.

We also calculate a p-value (probability value):
βœ… p < 0.05 β†’ Significant difference!
❌ p > 0.05 β†’ No strong evidence of difference.

πŸ”Ή Practical Examples

Independent T-Test (Comparing Two Models' Accuracy)

Imagine we trained two models and recorded their accuracies.

βœ… NumPy & SciPy Example

import numpy as np
from scipy import stats

# Model A accuracies (5 runs)
model_a = np.array([0.85, 0.87, 0.86, 0.88, 0.85])

# Model B accuracies (5 runs)
model_b = np.array([0.80, 0.82, 0.81, 0.79, 0.83])

# Perform Independent T-Test
t_stat, p_value = stats.ttest_ind(model_a, model_b)

print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Significant difference! One model is better.")
else:
    print("No significant difference. Models perform similarly.")

βœ… If p < 0.05, one model significantly outperforms the other.


2️⃣ Paired T-Test (Before vs. After Fine-Tuning)

Let's check if fine-tuning a model improves accuracy.

# Model accuracy BEFORE fine-tuning
before_finetune = np.array([0.75, 0.77, 0.76, 0.78, 0.76])

# Model accuracy AFTER fine-tuning
after_finetune = np.array([0.82, 0.84, 0.83, 0.85, 0.83])

# Perform Paired T-Test
t_stat, p_value = stats.ttest_rel(before_finetune, after_finetune)

print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Significant improvement after fine-tuning!")
else:
    print("No significant improvement.")

βœ… If p < 0.05, fine-tuning actually improved the model!


πŸ”Ή T-Test in PyTorch

If your model's performance scores are in PyTorch tensors, convert them to NumPy before running a T-Test.

import torch

# Convert PyTorch tensors to NumPy
model_a_torch = torch.tensor([0.85, 0.87, 0.86, 0.88, 0.85])
model_b_torch = torch.tensor([0.80, 0.82, 0.81, 0.79, 0.83])

t_stat, p_value = stats.ttest_ind(model_a_torch.numpy(), model_b_torch.numpy())

print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_value:.4f}")

βœ… Works the same way, just using PyTorch tensors!


πŸ”Ή When NOT to Use a T-Test

🚫 When data isn’t normally distributed β†’ Use non-parametric tests like the Mann-Whitney U test.
🚫 When sample sizes are very small (<5) β†’ T-test might be unreliable.
🚫 For comparing more than two groups β†’ Use ANOVA test instead.


πŸ”Ή Conclusion

  • T-Tests help compare model performance in a statistically valid way.
  • Independent T-Test β†’ Compare two different models.
  • Paired T-Test β†’ Compare before vs. after fine-tuning.
  • p < 0.05 means significant difference in model performance.
All rights reserved to Ahmad Mayahi