9 Feb 2025

Understanding Sampling & Sampling Variability: A Practical Guide with NumPy & PyTorch

In deep learning, we never train on the entire population of data. Instead, we sample from it. But not all samples are the same—this causes sampling variability.

What is Sampling?

Sampling means selecting a subset of data from a larger dataset.

Example: If we have 1 million images, we might randomly select 10,000 for training.

Why Do We Use Sampling in Deep Learning?

  • Training Efficiency → We can't train on billions of data points.
  • Mini-Batches in SGD → Instead of using the full dataset, we use small random batches.
  • Data Augmentation → We can sample data in different ways to improve generalization.

What is Sampling Variability?

Even if we sample multiple times from the same dataset, each sample might be slightly different, causing sampling variability.

Why Does It Matter in Deep Learning?

  • Model Performance Fluctuations → Training on different samples might lead to slightly different results.
  • Overfitting Risk → If a sample isn't representative, the model may not generalize well.
  • Cross-Validation Differences → Training on different folds of data can lead to variations in accuracy.

Practical Examples

Let's see how sampling and sampling variability work using NumPy & PyTorch.

Random Sampling (NumPy)

Let's randomly select a subset from a dataset.

import numpy as np

# Full dataset (100 data points)
full_data = np.arange(1, 101)  # [1, 2, 3, ..., 100]

# Randomly sample 10 elements
sample = np.random.choice(full_data, size=10, replace=False)

print("Random Sample:", sample)

✅ Each time you run this, you’ll get a different random sample.

Sampling Variability (NumPy)

If we sample multiple times, we get different samples, showing sampling variability.

# Take 3 different random samples
sample_1 = np.random.choice(full_data, size=10, replace=False)
sample_2 = np.random.choice(full_data, size=10, replace=False)
sample_3 = np.random.choice(full_data, size=10, replace=False)

print("Sample 1:", sample_1)
print("Sample 2:", sample_2)
print("Sample 3:", sample_3)

✅ Notice how the samples aren’t the same.

Mini-Batch Sampling in PyTorch

Deep learning models use mini-batch training instead of full dataset training.

import torch

# Create a dataset with 100 points
dataset = torch.arange(1, 101)

# Sample a mini-batch of size 10
batch = torch.randperm(len(dataset))[:10]  # Random indices

print("Mini-Batch Indices:", batch)
print("Mini-Batch Data:", dataset[batch])

Each training step, we pick a new mini-batch (sampling variability).

Effect of Sampling Variability on Model Training

Each time we train a model, we sample different mini-batches, which leads to slightly different models.

Let's simulate two training runs with different samples.

import torch.nn as nn
import torch.optim as optim

# Simulated dataset
X = torch.randn(100, 2)  # 100 samples, 2 features
y = torch.randint(0, 2, (100,))  # 100 binary labels (0 or 1)

# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc = nn.Linear(2, 1)

    def forward(self, x):
        return torch.sigmoid(self.fc(x))

# Train model with different mini-batches
def train_model():
    model = SimpleNN()
    optimizer = optim.SGD(model.parameters(), lr=0.1)
    loss_fn = nn.BCELoss()

    for epoch in range(5):  # Just 5 epochs for demo
        batch_idx = torch.randperm(len(X))[:10]  # Sample a new mini-batch
        batch_X, batch_y = X[batch_idx], y[batch_idx].float()

        optimizer.zero_grad()
        preds = model(batch_X).squeeze()
        loss = loss_fn(preds, batch_y)
        loss.backward()
        optimizer.step()

        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

print("Training Run 1:")
train_model()

print("\nTraining Run 2:")
train_model()

✅ Notice how loss values differ slightly between runs due to sampling variability.

Key Takeaways

Concept Meaning Example in Deep Learning
Sampling Selecting a subset of data Choosing 10K images from a dataset of 1M
Sampling Variability Different samples give slightly different results Different mini-batches in SGD affect model training
Mini-Batch Training Training on random small subsets instead of full data Used in Stochastic Gradient Descent (SGD)

Conclusion

  • Sampling is essential in deep learning to handle large datasets.
  • Sampling variability affects model performance and training behavior.
  • Every training run is slightly different due to random mini-batches.
All rights reserved to Ahmad Mayahi