Linear Regression from Scratch: Math, Code, and Intuition

Linear regression is often called the "Hello World" of Machine Learning. It’s simple, mathematically elegant, and remains incredibly useful in production environments today. More importantly, mastering linear regression provides the essential foundation for understanding nearly every complex algorithm that follows. In this guide, we’ll rebuild it from scratch to demystify the mechanics under the hood.

What is Linear Regression?

At its core, linear regression tries to find the best straight line that fits your data. Imagine you have data about house sizes and their prices. Linear regression helps you draw a line through those points so you can predict the price of a house given its size.

But here's the key question: what makes a line "the best"? That's where the math comes in.

The Math Behind It

The Equation: y = mx + b

Every straight line can be described by this equation:

$$y = mx + b$$

Where:

y is the predicted value (e.g., house price)
x is the input feature (e.g., house size)
m is the slope (how much y changes when x increases by 1)
b is the y-intercept (the value of y when x is 0)

In machine learning, we often use different notation:

$$\hat{y} = w \cdot x + b$$

Where w (weight) is the slope and b (bias) is the intercept. The hat over y indicates it's a prediction, not the actual value.

The Cost Function: Mean Squared Error (MSE)

How do we measure how "good" our line is? We look at the errors—the differences between our predictions and the actual values. The Mean Squared Error (MSE) is the most common choice:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Why squared? Two reasons:

It makes all errors positive (a prediction of +10 off is just as bad as -10 off)
It penalizes large errors more than small ones

Our goal is to find the values of w and b that minimize this cost function.

Gradient Descent Optimization

Gradient descent is like being blindfolded on a hill and trying to reach the lowest point. You feel the slope under your feet and take a step in the direction that goes downhill. Repeat until you reach the bottom.

Mathematically, we compute the gradients (partial derivatives) of the cost function with respect to our parameters:

$$\frac{\partial MSE}{\partial w} = \frac{-2}{n} \sum_{i=1}^{n} x_i(y_i - \hat{y}_i)$$

$$\frac{\partial MSE}{\partial b} = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)$$

Then we update our parameters:

$$w = w - \alpha \cdot \frac{\partial MSE}{\partial w}$$

$$b = b - \alpha \cdot \frac{\partial MSE}{\partial b}$$

Where alpha is the learning rate—how big of a step we take each time.

Implementation from Scratch with NumPy

Let's put this all together. Here's a complete, runnable implementation:

import numpy as np
import matplotlib.pyplot as plt
 
class LinearRegressionScratch:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []
 
    def fit(self, X, y):
        # Initialize parameters
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
 
        # Gradient descent
        for i in range(self.n_iterations):
            # Forward pass: compute predictions
            y_predicted = np.dot(X, self.weights) + self.bias
 
            # Compute cost (MSE)
            cost = np.mean((y - y_predicted) ** 2)
            self.cost_history.append(cost)
 
            # Compute gradients
            dw = (-2 / n_samples) * np.dot(X.T, (y - y_predicted))
            db = (-2 / n_samples) * np.sum(y - y_predicted)
 
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
 
        return self
 
    def predict(self, X):
        return np.dot(X, self.weights) + self.bias
 
    def score(self, X, y):
        """Calculate R-squared score"""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)
 
 
# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X.flatten() + np.random.randn(100)  # y = 4 + 3x + noise
 
# Train our model
model = LinearRegressionScratch(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)
 
print(f"Learned weight: {model.weights[0]:.4f} (true value: 3)")
print(f"Learned bias: {model.bias:.4f} (true value: 4)")
print(f"R-squared: {model.score(X, y):.4f}")

When you run this, you'll see our model learns values very close to the true parameters (w=3, b=4).

Visualizing the Results

Let's create some visualizations to understand what's happening:

# Create a figure with multiple subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Plot 1: Data and fitted line
axes[0].scatter(X, y, alpha=0.6, label='Data points')
X_line = np.linspace(0, 2, 100).reshape(-1, 1)
y_line = model.predict(X_line)
axes[0].plot(X_line, y_line, color='red', linewidth=2, label='Fitted line')
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].set_title('Linear Regression Fit')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Plot 2: Cost function over iterations
axes[1].plot(model.cost_history)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('MSE Cost')
axes[1].set_title('Cost Function Convergence')
axes[1].grid(True, alpha=0.3)
 
# Plot 3: Residuals
y_pred = model.predict(X)
residuals = y - y_pred
axes[2].scatter(y_pred, residuals, alpha=0.6)
axes[2].axhline(y=0, color='red', linestyle='--')
axes[2].set_xlabel('Predicted values')
axes[2].set_ylabel('Residuals')
axes[2].set_title('Residual Plot')
axes[2].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('linear_regression_analysis.png', dpi=150)
plt.show()

The cost function plot shows how our model improves over iterations—the cost should decrease and eventually plateau. The residual plot helps us verify that our model assumptions are reasonable (residuals should be randomly scattered around zero).

Using Scikit-learn for Comparison

Now let's see how scikit-learn does the same task:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
 
# Train sklearn model
sklearn_model = LinearRegression()
sklearn_model.fit(X, y)
 
# Compare results
print("Comparison of Results:")
print("-" * 40)
print(f"{'Parameter':<15} {'Scratch':<12} {'Sklearn':<12}")
print("-" * 40)
print(f"{'Weight':<15} {model.weights[0]:<12.4f} {sklearn_model.coef_[0]:<12.4f}")
print(f"{'Bias':<15} {model.bias:<12.4f} {sklearn_model.intercept_:<12.4f}")
print(f"{'R-squared':<15} {model.score(X, y):<12.4f} {sklearn_model.score(X, y):<12.4f}")
 
# Predictions comparison
y_pred_scratch = model.predict(X)
y_pred_sklearn = sklearn_model.predict(X)
 
print(f"\nMSE (Scratch): {mean_squared_error(y, y_pred_scratch):.4f}")
print(f"MSE (Sklearn): {mean_squared_error(y, y_pred_sklearn):.4f}")

You'll notice both models produce nearly identical results. The small differences come from:

Scikit-learn uses the closed-form solution (Normal Equation) instead of gradient descent
Our gradient descent might not have fully converged

When to Use Linear Regression

Linear regression is an excellent choice when:

The relationship is approximately linear: If plotting your data shows a roughly straight-line pattern, linear regression is a good starting point.
Interpretability matters: The coefficients directly tell you how much each feature affects the prediction. In many business contexts, this is invaluable.
You have limited data: Linear regression has few parameters, so it doesn't need massive datasets to train reliably.
Speed is critical: Training is extremely fast, and predictions are essentially instantaneous.
As a baseline: Always start with linear regression before trying more complex models. If it works well enough, why complicate things?

Limitations and Assumptions

Linear regression makes several assumptions that you should verify:

1. Linearity

The relationship between features and target must be linear. If you plot residuals vs. predictions and see a curve, you might need polynomial features or a different model.

2. Independence

Each observation should be independent. Time series data often violates this assumption.

3. Homoscedasticity

The variance of residuals should be constant across all predictions. A "funnel" shape in the residual plot indicates heteroscedasticity.

4. No Multicollinearity

Features shouldn't be highly correlated with each other. This makes coefficients unstable and hard to interpret.

5. Normal Residuals

For statistical inference (confidence intervals, p-values), residuals should be approximately normally distributed.

What to Do When Assumptions Are Violated

Non-linearity: Try polynomial features, splines, or tree-based models
Heteroscedasticity: Use weighted least squares or transform the target variable
Multicollinearity: Remove correlated features or use regularization (Ridge, Lasso)
Outliers: Consider robust regression methods

Conclusion

Linear regression might be simple, but that simplicity is its strength. By building it from scratch, you've learned:

How the cost function quantifies model performance
How gradient descent finds optimal parameters
How to implement and visualize the entire process
When linear regression is (and isn't) appropriate

This foundation will serve you well as you explore more complex algorithms. Ridge regression, Lasso, and even neural networks all build on these same principles of cost functions and gradient descent.

The next time you reach for scikit-learn's LinearRegression, you'll know exactly what's happening under the hood. And that understanding makes you a better data scientist.

Happy learning!