CodingLad
machine learning

Multiclass Logistic Regression: One-vs-One, One-vs-All, and Softmax Explained

Multiclass Logistic Regression: One-vs-One, One-vs-All, and Softmax Explained
0 views
10 min read
#machine learning

Multiclass Logistic Regression: One-vs-One, One-vs-All, and Softmax Explained

A concise guide to Multiclass Logistic Regression: extending logistic regression for multiclass problems. This article, "Multiclass Logistic Regression: One-vs-One, One-vs-All, and Softmax Explained," compares one-vs-all, one-vs-one, and softmax approaches with practical examples.

Introduction

In many machine learning tasks, we are interested in predicting a categorical outcome.
Binary Logistic Regression is one of the simplest and most fundamental models for such tasks — particularly designed for two-class (binary) problems.

Examples:

  • Email spam detection
  • Disease diagnosis (sick vs. healthy)
  • Sentiment analysis (positive vs. negative)

However, real-world problems often involve more than two categories. For example:

  • Classifying emails into spam, personal, or work
  • Predicting the species of a flower from multiple types

In such scenarios, we extend logistic regression to multiclass classification.

Binary vs Multiclass Classification

Extending Logistic Regression to Multiclass Classification

We extend logistic regression to handle multiple classes using these two key strategies:

One-vs-One (OvO)

One-vs-One Classification

Strategy: Train K(K1)2\frac{K(K-1)}{2} binary classifiers for KK classes - one for each pair.

Mechanics:

  • Each classifier uses sigmoid to produce probabilities:
hθ(i,j)(x)=11+eθ(i,j)xh_{\theta^{(i,j)}}(x) = \frac{1}{1 + e^{-\theta^{(i,j)\top}x}}
  • Prediction threshold: If hθ(i,j)(x)>0.5h_{\theta^{(i,j)}}(x) > 0.5, class ii gets a vote; else class jj

** Example Calculation (3 classes):**

ClassifiersProbabilitiesVotes
Class 1 vs 2hθ(1,2)(x)=0.7h_{\theta^{(1,2)}}(x) = 0.7Class 1
Class 1 vs 3hθ(1,3)(x)=0.6h_{\theta^{(1,3)}}(x) = 0.6Class 1
Class 2 vs 3hθ(2,3)(x)=0.6h_{\theta^{(2,3)}}(x) = 0.6Class 2

Final Prediction: Class 1 (2 vote), Class 2 (1), Class 3 (0) → Class 1 wins!

One-vs-All (OvA)

One-vs-All Classification

Strategy: Train KK classifiers - each distinguishes one class from all others.

Mechanics:

  • Each classifier uses sigmoid for class probability:
hθ(i)(x)=11+eθ(i)xh_{\theta^{(i)}}(x) = \frac{1}{1 + e^{-\theta^{(i)\top}x}}
  • Final prediction: Class with highest probability score

** Example Calculation (3 classes):**

ClassifierProbability
Class 1 vs Resthθ(1)(x)=0.85h_{\theta^{(1)}}(x) = 0.85
Class 2 vs Resthθ(2)(x)=0.55h_{\theta^{(2)}}(x) = 0.55
Class 3 vs Resthθ(3)(x)=0.25h_{\theta^{(3)}}(x) = 0.25

Final Prediction: Class 1 (max probability 0.85)

Sigmoid Function: The Probability Engine

Sigmoid Curve for Logistic Regression

Both strategies rely on the sigmoid function to convert linear combinations into probabilities:

hθ(z)=11+ezwhere z=θxh_\theta(z) = \frac{1}{1 + e^{-z}} \quad \text{where } z = \theta^\top x

Key Properties:

  • Squashes values between 0 and 1
  • z=0z = 0 gives h=0.5h = 0.5 (decision boundary)
  • Threshold at 0.5 for binary decisions
  • Smooth gradient for optimization

Strategy Comparison

FeatureOne-vs-OneOne-vs-All
# ClassifiersK(K1)/2K(K-1)/2 (Explosive growth)KK (Linear growth)
Training Data Per ModelBalanced pairsImbalanced (1 vs many)
Best ForSmall KK, balanced classesLarge KK, imbalanced data
ComputationMore parallelizableLess resource-intensive

Multinomial Logistic Regression (Softmax)

While OvO and OvA work by combining binary classifiers, they have a fundamental limitation: their probability scores don't form a proper probability distribution across all classes.

The Probability Sum Problem

Consider our OvA example with 3 classes:

ClassifierProbability
Class 1 vs Rest0.85
Class 2 vs Rest0.55
Class 3 vs Rest0.25

Total "probability": 0.85 + 0.55 + 0.25 = 1.65
(Not a valid probability distribution - should sum to 1)

This happens because each classifier operates independently without considering other classes' scores.

Softmax Solution

Multinomial Logistic Regression solves this using the softmax function to ensure probabilities are:

  • Between 0 and 1
  • Sum to 1 across all classes

Softmax Formula:

P(y=ix)=eθixj=1KeθjxP(y=i|x) = \frac{e^{\theta_i^\top x}}{\sum_{j=1}^K e^{\theta_j^\top x}}

** Example Calculation (3 classes):**

Given raw scores (logits):

  • Class 1: z1=θ1x=2.0z_1 = \theta_1^\top x = 2.0
  • Class 2: z2=θ2x=1.0z_2 = \theta_2^\top x = 1.0
  • Class 3: z3=θ3x=1.0z_3 = \theta_3^\top x = -1.0
  1. Compute exponents:

    • e2.07.389e^{2.0} ≈ 7.389
    • e1.02.718e^{1.0} ≈ 2.718
    • e1.00.368e^{-1.0} ≈ 0.368
  2. Sum exponents:
    7.389+2.718+0.36810.4757.389 + 2.718 + 0.368 ≈ 10.475

  3. Normalize:

    • P(y=1)=7.389/10.4750.705P(y=1) = 7.389/10.475 ≈ 0.705
    • P(y=2)=2.718/10.4750.259P(y=2) = 2.718/10.475 ≈ 0.259
    • P(y=3)=0.368/10.4750.035P(y=3) = 0.368/10.475 ≈ 0.035

Total Probability: 0.705 + 0.259 + 0.035 = 1.0

Softmax vs Sigmoid Strategies

FeatureOvO/OvA + SigmoidMultinomial + Softmax
Probability GuaranteeNo sum-to-1 constraintValid probability distribution
Class InterdependenceTreats classes independentlyModels class relationships
ComputationMultiple binary modelsSingle unified model
Decision BoundaryPiecewise linearSmooth global boundary
ScalabilityBetter for small KMore efficient for large K

** Key Insight:** Softmax creates competition between classes - increasing one class' probability automatically decreases others' probabilities. This mimics how real-world categories often relate to each other.

Shared Machinery: Cost, Learning & Evaluation

While multiclass strategies differ in their modeling approaches, they share common components in optimization and evaluation:

Cost Functions

1. OvO & OvA (Binary Cross-Entropy)
Each binary classifier uses:

J(θ)=1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]J(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)}))\right]

2. Softmax (Categorical Cross-Entropy)

Single unified cost function:

J(θ)=1mi=1mk=1Kyk(i)log(y^k(i))J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log \left( \hat{y}_k^{(i)} \right)

Where:

  • yk(i)y_k^{(i)} is 1 only for the correct class.
  • y^k(i)\hat{y}_k^{(i)} is the predicted probability for class kk, computed via softmax.

Interpretation:

  • yk(i)y_k^{(i)} is 1 if example ii belongs to class kk, and 0 otherwise.
  • y^k=P(y=kx(i);θ)\hat{y}^k = P(y = k \mid x^{(i)}; \theta) is the predicted probability that example ii belongs to class kk under the current model parameters.

Softmax ensures the outputs are valid class probabilities that sum to 1

y^k=ezkj=1Kezj\hat{y}_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

Learning: Gradient Descent Variations

MethodParameters UpdatedComputational Complexity
OvOK(K1)2\frac{K(K-1)}{2} independent θsHigh (Requires training O(K2)\mathcal{O}(K^2) models)
OvAKK independent θsModerate (K separate models)
SoftmaxSingle K×dK \times d parameter matrixEfficient (Single model)

Example: Softmax Gradient Derivation
For class jj parameters θj\theta_j:

θjJ(Θ)=1mi=1mx(i)(pj(i)yj(i))\nabla_{\theta_j} J(\Theta) = \frac{1}{m} \sum_{i=1}^m x^{(i)}\left(p_j^{(i)} - y_j^{(i)}\right)

Where:

  • pj(i)=P(y(i)=jx(i))p_j^{(i)} = P(y^{(i)} = j|x^{(i)}) (softmax probability)
  • yj(i)=I[True class=j]y_j^{(i)} = \mathbb{I}[\text{True class} = j] (indicator function)

Update Rule for Softmax:

θj:=θjαθjJ(Θ)\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\Theta) θj:=θjαmi=1mx(i)(pj(i)yj(i))\Rightarrow \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m x^{(i)}\left(p_j^{(i)} - y_j^{(i)}\right)

Key Differences:

  1. OvO/OvA: Requires parallel gradient updates across multiple independent models
  2. Softmax: Single coherent update across all classes using matrix operations
  3. Numerical Stability: Softmax implementations typically use log-sum-exp trick

Evaluation: Cross-Validation & Metrics

Common Practices Across All Methods:

  1. Basic Train/Test Split
  • Split the dataset into two parts: a training set (e.g., 80%) and a test set (e.g., 20%).
  • Train the model on the training set.
  • Evaluate performance (accuracy, precision, recall, F1-score, etc.) on the test set.
  • Simple and fast, but results can vary depending on how the data is split.
  1. k-Fold Cross-Validation
  • Divide the dataset into k equal-sized folds (e.g., k=5 or 10).
  • For each fold:
    • Use that fold as the validation set, and the remaining k-1 folds as the training set.
    • Train and evaluate the model.
  • Average the results across all k runs for a more robust estimate of model performance.
  • OvO/OvA require refitting all pairwise/one-vs-rest models per fold.
  • Helps reduce variance due to random data splits and gives a better sense of generalization.

End-to-End Process Comparison

StepOvO/OvASoftmax
1. TrainingTrain multiple binary modelsTrain single multiclass model
2. PredictionAggregate votes/scoresDirect probability computation
3. Cross-ValidationValidate each model separatelyValidate unified model
4. Hyperparameter TuningTune each model or global paramsSingle parameter space tuning

Practical Tip: Use class LogisticRegression(multi_class='multinomial') in sklearn to access softmax implementation directly. For OvO/OvA, use OneVsOneClassifier or OneVsRestClassifier wrappers.

🏁 When to Use Which?

  • Binary Classification: Sigmoid (logistic regression)
  • Small K, Fast Prototyping: OvO/OvA
  • Theoretical Correctness: Softmax
  • Large-Scale Production: Softmax
  • Class Imbalance: Softmax (handles relative probabilities better)