Logistic Regression Demystified: Classification Made Simple

Juan Luis Ramirez10 min read
Machine LearningLogistic RegressionClassificationPython

If you've ever wondered how an email filter decides if a message is spam, or how a medical system identifies a tumor as malignant, you've seen classification in action. One of the most elegant and widely used solutions for these problems is Logistic Regression. Despite its name, it is a classification powerhouse—simple to implement, highly interpretable, and a staple in every data scientist's toolkit.

In this guide, we’ll demystify logistic regression from the ground up. You’ll learn how to transform linear predictions into probabilities, build a custom implementation in NumPy, and leverage Scikit-learn to solve real-world problems.

From Regression to Classification

Linear regression predicts continuous values. But what if we want to predict categories? We can't just use a line to separate classes—we need a way to squeeze our predictions into a probability between 0 and 1.

That's where the sigmoid function comes in.

The Sigmoid Function: The Heart of Logistic Regression

The sigmoid function (also called the logistic function) is the mathematical magic that transforms any real number into a probability:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Let's visualize it:

import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
# Generate values
z = np.linspace(-10, 10, 100)
y = sigmoid(z)
 
# Plot
plt.figure(figsize=(10, 6))
plt.plot(z, y, 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision boundary (0.5)')
plt.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
plt.axhline(y=1, color='gray', linestyle='-', alpha=0.3)
plt.axvline(x=0, color='gray', linestyle='-', alpha=0.3)
plt.xlabel('z (linear combination)')
plt.ylabel('σ(z) (probability)')
plt.title('The Sigmoid Function')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Notice how:

  • When z is very negative, σ(z) approaches 0
  • When z is very positive, σ(z) approaches 1
  • When z = 0, σ(z) = 0.5

This S-shaped curve is perfect for representing probabilities.

Log-Odds and Probability Interpretation

In logistic regression, we model the log-odds (or logit) of the probability as a linear function of our features:

$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$

The term p/(1-p) is called the odds ratio. If you've ever bet on sports, you're familiar with odds. The log of this ratio gives us a value that can range from negative infinity to positive infinity—perfect for a linear model.

Rearranging, we get the probability:

$$p = \sigma(\beta_0 + \beta_1 x_1 + ... + \beta_n x_n)$$

Interpreting coefficients: A coefficient β tells us how much the log-odds change for a one-unit increase in that feature. For example, if β = 0.5, a one-unit increase in x multiplies the odds by e^0.5 ≈ 1.65.

Decision Boundaries Explained

The decision boundary is where the model is uncertain—where P(y=1) = 0.5. This happens when:

$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... = 0$$

For two features, this is a straight line. Let's visualize:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
 
# Generate sample data
np.random.seed(42)
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1,
                           class_sep=1.5)
 
# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', marker='o', label='Class 0', alpha=0.7)
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', marker='s', label='Class 1', alpha=0.7)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Binary Classification Problem')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

The logistic regression model will find the optimal line (in 2D) or hyperplane (in higher dimensions) that best separates these classes.

Implementation from Scratch with NumPy

Let's build logistic regression from the ground up to truly understand how it works:

import numpy as np
 
class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.losses = []
 
    def sigmoid(self, z):
        # Clip to avoid overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
 
    def compute_loss(self, y_true, y_pred):
        # Binary cross-entropy loss
        epsilon = 1e-15  # To avoid log(0)
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss
 
    def fit(self, X, y):
        n_samples, n_features = X.shape
 
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
 
        # Gradient descent
        for i in range(self.n_iterations):
            # Forward pass
            linear_model = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(linear_model)
 
            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)
 
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
 
            # Track loss
            loss = self.compute_loss(y, y_pred)
            self.losses.append(loss)
 
        return self
 
    def predict_proba(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        return self.sigmoid(linear_model)
 
    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int)
 
# Test our implementation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Standardize features (important for gradient descent)
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)
X_train_scaled = (X_train - mean) / std
X_test_scaled = (X_test - mean) / std
 
# Train model
model = LogisticRegressionScratch(learning_rate=0.1, n_iterations=1000)
model.fit(X_train_scaled, y_train)
 
# Evaluate
train_accuracy = np.mean(model.predict(X_train_scaled) == y_train)
test_accuracy = np.mean(model.predict(X_test_scaled) == y_test)
 
print(f"Train accuracy: {train_accuracy:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")

Let's also visualize the training process:

plt.figure(figsize=(10, 6))
plt.plot(model.losses)
plt.xlabel('Iteration')
plt.ylabel('Loss (Binary Cross-Entropy)')
plt.title('Training Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()

Using sklearn's LogisticRegression

In practice, you'll use scikit-learn's optimized implementation:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
# Create a pipeline with scaling and logistic regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])
 
# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
 
# Get probabilities
y_proba = pipeline.predict_proba(X_test)
print(f"Sample probabilities (first 5):\n{y_proba[:5]}")
 
# Accuracy
accuracy = pipeline.score(X_test, y_test)
print(f"\nTest accuracy: {accuracy:.4f}")

Key parameters to know:

  • C: Inverse of regularization strength (smaller = stronger regularization)
  • penalty: Type of regularization ('l1', 'l2', 'elasticnet', 'none')
  • solver: Optimization algorithm ('lbfgs', 'liblinear', 'saga', etc.)
  • max_iter: Maximum number of iterations

Multiclass Classification: One-vs-Rest

Logistic regression naturally handles binary classification, but what about multiple classes? The most common approach is One-vs-Rest (OvR):

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
 
# Load iris dataset (3 classes)
iris = load_iris()
X, y = iris.data, iris.target
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Logistic Regression with OvR (default for multi-class)
model = LogisticRegression(multi_class='ovr', max_iter=1000, random_state=42)
model.fit(X_train, y_train)
 
# Predictions
y_pred = model.predict(X_test)
 
# Detailed report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
 
# Probabilities for each class
print("\nProbabilities for first 3 samples:")
print(model.predict_proba(X_test[:3]))

Scikit-learn also supports multinomial (softmax) classification:

model_multinomial = LogisticRegression(multi_class='multinomial', solver='lbfgs',
                                       max_iter=1000, random_state=42)
model_multinomial.fit(X_train, y_train)
print(f"Multinomial accuracy: {model_multinomial.score(X_test, y_test):.4f}")

Evaluation Metrics: Beyond Accuracy

Accuracy isn't always enough. Here's a complete evaluation toolkit:

from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, roc_auc_score,
                             roc_curve, precision_recall_curve)
import matplotlib.pyplot as plt
 
# Binary classification example
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           weights=[0.7, 0.3], random_state=42)  # Imbalanced
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
 
# Metrics
print("Evaluation Metrics:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_proba):.4f}")
 
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:\n{cm}")

Understanding the metrics:

  • Accuracy: Overall correctness (can be misleading with imbalanced data)
  • Precision: Of all positive predictions, how many were correct? (important when false positives are costly)
  • Recall: Of all actual positives, how many did we catch? (important when false negatives are costly)
  • F1 Score: Harmonic mean of precision and recall
  • ROC AUC: Area under the ROC curve (overall model quality)

Let's visualize the ROC curve:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
axes[0].plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {roc_auc_score(y_test, y_proba):.3f})')
axes[0].plot([0, 1], [0, 1], 'r--', linewidth=1, label='Random Classifier')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Precision-Recall Curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
axes[1].plot(recall, precision, 'g-', linewidth=2)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()

Real-World Example: Customer Churn Prediction

Let's put everything together with a realistic example:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt
 
# Simulate customer churn data
np.random.seed(42)
n_customers = 2000
 
data = {
    'tenure_months': np.random.exponential(24, n_customers),
    'monthly_charges': np.random.normal(70, 20, n_customers),
    'total_charges': np.random.exponential(1500, n_customers),
    'num_support_tickets': np.random.poisson(2, n_customers),
    'contract_type': np.random.choice([0, 1, 2], n_customers),  # Month-to-month, 1-year, 2-year
    'has_online_backup': np.random.choice([0, 1], n_customers),
    'has_tech_support': np.random.choice([0, 1], n_customers),
}
 
df = pd.DataFrame(data)
 
# Create churn target (more likely with short tenure, high charges, many tickets)
churn_probability = (
    -0.05 * df['tenure_months'] +
    0.02 * df['monthly_charges'] +
    0.15 * df['num_support_tickets'] -
    0.5 * df['contract_type'] -
    0.3 * df['has_tech_support']
)
churn_probability = 1 / (1 + np.exp(-churn_probability))
df['churned'] = (np.random.random(n_customers) < churn_probability).astype(int)
 
print(f"Churn rate: {df['churned'].mean():.2%}")
print(f"\nDataset shape: {df.shape}")
print(df.head())
 
# Prepare features and target
X = df.drop('churned', axis=1)
y = df['churned']
 
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                     stratify=y, random_state=42)
 
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# Train logistic regression
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)
 
# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"\nCross-validation ROC AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
 
# Test set performance
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]
 
print(f"\nTest ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
 
# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_[0],
    'odds_ratio': np.exp(model.coef_[0])
}).sort_values('coefficient', key=abs, ascending=False)
 
print("\nFeature Importance:")
print(feature_importance)
 
# Visualize feature importance
plt.figure(figsize=(10, 6))
colors = ['red' if c > 0 else 'green' for c in feature_importance['coefficient']]
plt.barh(feature_importance['feature'], feature_importance['coefficient'], color=colors)
plt.xlabel('Coefficient (Log-Odds)')
plt.title('Feature Importance in Churn Prediction')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.show()

Conclusion

Logistic regression is a powerful, interpretable algorithm that should be in every data scientist's toolkit. Here's what we covered:

  1. The sigmoid function transforms linear combinations into probabilities
  2. Log-odds provide a linear relationship we can model
  3. Decision boundaries are where P(y=1) = 0.5
  4. Implementation is straightforward with gradient descent
  5. Scikit-learn provides production-ready implementations
  6. Multiclass problems use one-vs-rest or multinomial approaches
  7. Evaluation requires multiple metrics, especially for imbalanced data

When to use logistic regression:

  • When you need interpretable predictions
  • As a baseline before trying complex models
  • When the relationship between features and log-odds is approximately linear
  • When you have limited training data

When to consider alternatives:

  • Highly non-linear decision boundaries (try tree-based models or neural networks)
  • High-dimensional data with many features (consider regularized variants or dimensionality reduction)

Logistic regression may seem simple, but its elegance lies in that simplicity. It's often the first model you should try, and sometimes it's all you need.

Happy classifying!