CodingLad
machine learning

Multiclass Logistic Regression: One-vs-One, One-vs-All, and Softmax Explained

Multiclass Logistic Regression: One-vs-One, One-vs-All, and Softmax Explained
0 views
11 min read
#machine learning

Multiclass Logistic Regression: One-vs-One, One-vs-All, and Softmax Explained

A concise guide to Multiclass Logistic Regression: extending logistic regression for multiclass problems. This article, "Multiclass Logistic Regression: One-vs-One, One-vs-All, and Softmax Explained," compares one-vs-all, one-vs-one, and softmax approaches with practical examples.

Introduction

In many machine learning tasks, we are interested in predicting a categorical outcome.
Binary Logistic Regression is one of the simplest and most fundamental models for such tasks โ€” particularly designed for two-class (binary) problems.

Examples:

  • Email spam detection
  • Disease diagnosis (sick vs. healthy)
  • Sentiment analysis (positive vs. negative)

However, real-world problems often involve more than two categories. For example:

  • Classifying emails into spam, personal, or work
  • Predicting the species of a flower from multiple types

In such scenarios, we extend logistic regression to multiclass classification.

Binary vs Multiclass Classification

Extending Logistic Regression to Multiclass Classification

We extend logistic regression to handle multiple classes using these two key strategies:

๐Ÿง  One-vs-One (OvO)

One-vs-One Classification

Strategy: Train K(Kโˆ’1)2\frac{K(K-1)}{2} binary classifiers for KK classes - one for each pair.

Mechanics:

  • Each classifier uses sigmoid to produce probabilities:
hฮธ(i,j)(x)=11+eโˆ’ฮธ(i,j)โŠคxh_{\theta^{(i,j)}}(x) = \frac{1}{1 + e^{-\theta^{(i,j)\top}x}}
  • Prediction threshold: If hฮธ(i,j)(x)>0.5h_{\theta^{(i,j)}}(x) > 0.5, class ii gets a vote; else class jj

๐Ÿ”ข Example Calculation (3 classes):

ClassifiersProbabilitiesVotes
Class 1 vs 2hฮธ(1,2)(x)=0.7h_{\theta^{(1,2)}}(x) = 0.7Class 1
Class 1 vs 3hฮธ(1,3)(x)=0.6h_{\theta^{(1,3)}}(x) = 0.6Class 1
Class 2 vs 3hฮธ(2,3)(x)=0.6h_{\theta^{(2,3)}}(x) = 0.6Class 2

Final Prediction: Class 1 (2 vote), Class 2 (1), Class 3 (0) โ†’ Class 1 wins!

๐ŸŒ One-vs-All (OvA)

One-vs-All Classification

Strategy: Train KK classifiers - each distinguishes one class from all others.

Mechanics:

  • Each classifier uses sigmoid for class probability:
hฮธ(i)(x)=11+eโˆ’ฮธ(i)โŠคxh_{\theta^{(i)}}(x) = \frac{1}{1 + e^{-\theta^{(i)\top}x}}
  • Final prediction: Class with highest probability score

๐Ÿ”ข Example Calculation (3 classes):

ClassifierProbability
Class 1 vs Resthฮธ(1)(x)=0.85h_{\theta^{(1)}}(x) = 0.85
Class 2 vs Resthฮธ(2)(x)=0.55h_{\theta^{(2)}}(x) = 0.55
Class 3 vs Resthฮธ(3)(x)=0.25h_{\theta^{(3)}}(x) = 0.25

Final Prediction: Class 1 (max probability 0.85)

๐Ÿ“ˆ Sigmoid Function: The Probability Engine

Sigmoid Curve for Logistic Regression

Both strategies rely on the sigmoid function to convert linear combinations into probabilities:

hฮธ(z)=11+eโˆ’zwhereย z=ฮธโŠคxh_\theta(z) = \frac{1}{1 + e^{-z}} \quad \text{where } z = \theta^\top x

Key Properties:

  • Squashes values between 0 and 1
  • z=0z = 0 gives h=0.5h = 0.5 (decision boundary)
  • Threshold at 0.5 for binary decisions
  • Smooth gradient for optimization

๐Ÿ† Strategy Comparison

FeatureOne-vs-OneOne-vs-All
# ClassifiersK(Kโˆ’1)/2K(K-1)/2 (Explosive growth)KK (Linear growth)
Training Data Per ModelBalanced pairsImbalanced (1 vs many)
Best ForSmall KK, balanced classesLarge KK, imbalanced data
ComputationMore parallelizableLess resource-intensive

๐Ÿ”ฅ Multinomial Logistic Regression (Softmax)

While OvO and OvA work by combining binary classifiers, they have a fundamental limitation: their probability scores don't form a proper probability distribution across all classes.

โŒ The Probability Sum Problem

Consider our OvA example with 3 classes:

ClassifierProbability
Class 1 vs Rest0.85
Class 2 vs Rest0.55
Class 3 vs Rest0.25

Total "probability": 0.85 + 0.55 + 0.25 = 1.65
(Not a valid probability distribution - should sum to 1)

This happens because each classifier operates independently without considering other classes' scores.

๐ŸŽฏ Softmax Solution

Multinomial Logistic Regression solves this using the softmax function to ensure probabilities are:

  • Between 0 and 1
  • Sum to 1 across all classes

Softmax Formula:

P(y=iโˆฃx)=eฮธiโŠคxโˆ‘j=1KeฮธjโŠคxP(y=i|x) = \frac{e^{\theta_i^\top x}}{\sum_{j=1}^K e^{\theta_j^\top x}}

๐Ÿ”ข Example Calculation (3 classes):

Given raw scores (logits):

  • Class 1: z1=ฮธ1โŠคx=2.0z_1 = \theta_1^\top x = 2.0
  • Class 2: z2=ฮธ2โŠคx=1.0z_2 = \theta_2^\top x = 1.0
  • Class 3: z3=ฮธ3โŠคx=โˆ’1.0z_3 = \theta_3^\top x = -1.0
  1. Compute exponents:

    • e2.0โ‰ˆ7.389e^{2.0} โ‰ˆ 7.389
    • e1.0โ‰ˆ2.718e^{1.0} โ‰ˆ 2.718
    • eโˆ’1.0โ‰ˆ0.368e^{-1.0} โ‰ˆ 0.368
  2. Sum exponents:
    7.389+2.718+0.368โ‰ˆ10.4757.389 + 2.718 + 0.368 โ‰ˆ 10.475

  3. Normalize:

    • P(y=1)=7.389/10.475โ‰ˆ0.705P(y=1) = 7.389/10.475 โ‰ˆ 0.705
    • P(y=2)=2.718/10.475โ‰ˆ0.259P(y=2) = 2.718/10.475 โ‰ˆ 0.259
    • P(y=3)=0.368/10.475โ‰ˆ0.035P(y=3) = 0.368/10.475 โ‰ˆ 0.035

Total Probability: 0.705 + 0.259 + 0.035 = 1.0 โœ…

๐Ÿ†š Softmax vs Sigmoid Strategies

FeatureOvO/OvA + SigmoidMultinomial + Softmax
Probability GuaranteeNo sum-to-1 constraintValid probability distribution
Class InterdependenceTreats classes independentlyModels class relationships
ComputationMultiple binary modelsSingle unified model
Decision BoundaryPiecewise linearSmooth global boundary
ScalabilityBetter for small KMore efficient for large K

๐Ÿ’ก Key Insight: Softmax creates competition between classes - increasing one class' probability automatically decreases others' probabilities. This mimics how real-world categories often relate to each other.

โš™๏ธ Shared Machinery: Cost, Learning & Evaluation

While multiclass strategies differ in their modeling approaches, they share common components in optimization and evaluation:

๐Ÿ“‰ Cost Functions

1. OvO & OvA (Binary Cross-Entropy)
Each binary classifier uses:

J(ฮธ)=โˆ’1mโˆ‘i=1m[y(i)logโก(hฮธ(x(i)))+(1โˆ’y(i))logโก(1โˆ’hฮธ(x(i)))]J(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)}))\right]

2. Softmax (Categorical Cross-Entropy)

Single unified cost function:

J(ฮธ)=โˆ’1mโˆ‘i=1mโˆ‘k=1Kyk(i)logโก(y^k(i))J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log \left( \hat{y}_k^{(i)} \right)

Where:

  • yk(i)y_k^{(i)} is 1 only for the correct class.
  • y^k(i)\hat{y}_k^{(i)} is the predicted probability for class kk, computed via softmax.

Interpretation:

  • yk(i)y_k^{(i)} is 1 if example ii belongs to class kk, and 0 otherwise.
  • y^k=P(y=kโˆฃx(i);ฮธ)\hat{y}^k = P(y = k \mid x^{(i)}; \theta) is the predicted probability that example ii belongs to class kk under the current model parameters.

Softmax ensures the outputs are valid class probabilities that sum to 1

y^k=ezkโˆ‘j=1Kezj\hat{y}_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

๐Ÿง  Learning: Gradient Descent Variations

MethodParameters UpdatedComputational Complexity
OvOK(Kโˆ’1)2\frac{K(K-1)}{2} independent ฮธsHigh (Requires training O(K2)\mathcal{O}(K^2) models)
OvAKK independent ฮธsModerate (K separate models)
SoftmaxSingle Kร—dK \times d parameter matrixEfficient (Single model)

Example: Softmax Gradient Derivation
For class jj parameters ฮธj\theta_j:

โˆ‡ฮธjJ(ฮ˜)=1mโˆ‘i=1mx(i)(pj(i)โˆ’yj(i))\nabla_{\theta_j} J(\Theta) = \frac{1}{m} \sum_{i=1}^m x^{(i)}\left(p_j^{(i)} - y_j^{(i)}\right)

Where:

  • pj(i)=P(y(i)=jโˆฃx(i))p_j^{(i)} = P(y^{(i)} = j|x^{(i)}) (softmax probability)
  • yj(i)=I[Trueย class=j]y_j^{(i)} = \mathbb{I}[\text{True class} = j] (indicator function)

Update Rule for Softmax:

ฮธj:=ฮธjโˆ’ฮฑโˆ‡ฮธjJ(ฮ˜)\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\Theta) โ‡’ฮธj:=ฮธjโˆ’ฮฑmโˆ‘i=1mx(i)(pj(i)โˆ’yj(i))\Rightarrow \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m x^{(i)}\left(p_j^{(i)} - y_j^{(i)}\right)

Key Differences:

  1. OvO/OvA: Requires parallel gradient updates across multiple independent models
  2. Softmax: Single coherent update across all classes using matrix operations
  3. Numerical Stability: Softmax implementations typically use log-sum-exp trick

๐Ÿ“Š Evaluation: Cross-Validation & Metrics

Common Practices Across All Methods:

  1. Basic Train/Test Split
  • Split the dataset into two parts: a training set (e.g., 80%) and a test set (e.g., 20%).
  • Train the model on the training set.
  • Evaluate performance (accuracy, precision, recall, F1-score, etc.) on the test set.
  • Simple and fast, but results can vary depending on how the data is split.
  1. k-Fold Cross-Validation
  • Divide the dataset into k equal-sized folds (e.g., k=5 or 10).
  • For each fold:
    • Use that fold as the validation set, and the remaining k-1 folds as the training set.
    • Train and evaluate the model.
  • Average the results across all k runs for a more robust estimate of model performance.
  • OvO/OvA require refitting all pairwise/one-vs-rest models per fold.
  • Helps reduce variance due to random data splits and gives a better sense of generalization.

๐Ÿ”„ End-to-End Process Comparison

StepOvO/OvASoftmax
1. TrainingTrain multiple binary modelsTrain single multiclass model
2. PredictionAggregate votes/scoresDirect probability computation
3. Cross-ValidationValidate each model separatelyValidate unified model
4. Hyperparameter TuningTune each model or global paramsSingle parameter space tuning

Practical Tip: Use class LogisticRegression(multi_class='multinomial') in sklearn to access softmax implementation directly. For OvO/OvA, use OneVsOneClassifier or OneVsRestClassifier wrappers.

๐Ÿ When to Use Which?

  • Binary Classification: Sigmoid (logistic regression)
  • Small K, Fast Prototyping: OvO/OvA
  • Theoretical Correctness: Softmax
  • Large-Scale Production: Softmax
  • Class Imbalance: Softmax (handles relative probabilities better)