Linear Regression from Scratch: Math, Code, and Intuition
Linear regression is often called the "Hello World" of Machine Learning. It’s simple, mathematically elegant, and remains incredibly useful in production environments today. More importantly, mastering linear regression provides the essential foundation for understanding nearly every complex algorithm that follows. In this guide, we’ll rebuild it from scratch to demystify the mechanics under the hood.
What is Linear Regression?
At its core, linear regression tries to find the best straight line that fits your data. Imagine you have data about house sizes and their prices. Linear regression helps you draw a line through those points so you can predict the price of a house given its size.
But here's the key question: what makes a line "the best"? That's where the math comes in.
The Math Behind It
The Equation: y = mx + b
Every straight line can be described by this equation:
$$y = mx + b$$
Where:
- y is the predicted value (e.g., house price)
- x is the input feature (e.g., house size)
- m is the slope (how much y changes when x increases by 1)
- b is the y-intercept (the value of y when x is 0)
In machine learning, we often use different notation:
$$\hat{y} = w \cdot x + b$$
Where w (weight) is the slope and b (bias) is the intercept. The hat over y indicates it's a prediction, not the actual value.
The Cost Function: Mean Squared Error (MSE)
How do we measure how "good" our line is? We look at the errors—the differences between our predictions and the actual values. The Mean Squared Error (MSE) is the most common choice:
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Why squared? Two reasons:
- It makes all errors positive (a prediction of +10 off is just as bad as -10 off)
- It penalizes large errors more than small ones
Our goal is to find the values of w and b that minimize this cost function.
Gradient Descent Optimization
Gradient descent is like being blindfolded on a hill and trying to reach the lowest point. You feel the slope under your feet and take a step in the direction that goes downhill. Repeat until you reach the bottom.
Mathematically, we compute the gradients (partial derivatives) of the cost function with respect to our parameters:
$$\frac{\partial MSE}{\partial w} = \frac{-2}{n} \sum_{i=1}^{n} x_i(y_i - \hat{y}_i)$$
$$\frac{\partial MSE}{\partial b} = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)$$
Then we update our parameters:
$$w = w - \alpha \cdot \frac{\partial MSE}{\partial w}$$
$$b = b - \alpha \cdot \frac{\partial MSE}{\partial b}$$
Where alpha is the learning rate—how big of a step we take each time.
Implementation from Scratch with NumPy
Let's put this all together. Here's a complete, runnable implementation:
import numpy as np
import matplotlib.pyplot as plt
class LinearRegressionScratch:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
self.cost_history = []
def fit(self, X, y):
# Initialize parameters
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
# Gradient descent
for i in range(self.n_iterations):
# Forward pass: compute predictions
y_predicted = np.dot(X, self.weights) + self.bias
# Compute cost (MSE)
cost = np.mean((y - y_predicted) ** 2)
self.cost_history.append(cost)
# Compute gradients
dw = (-2 / n_samples) * np.dot(X.T, (y - y_predicted))
db = (-2 / n_samples) * np.sum(y - y_predicted)
# Update parameters
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
return self
def predict(self, X):
return np.dot(X, self.weights) + self.bias
def score(self, X, y):
"""Calculate R-squared score"""
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - (ss_res / ss_tot)
# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X.flatten() + np.random.randn(100) # y = 4 + 3x + noise
# Train our model
model = LinearRegressionScratch(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)
print(f"Learned weight: {model.weights[0]:.4f} (true value: 3)")
print(f"Learned bias: {model.bias:.4f} (true value: 4)")
print(f"R-squared: {model.score(X, y):.4f}")When you run this, you'll see our model learns values very close to the true parameters (w=3, b=4).
Visualizing the Results
Let's create some visualizations to understand what's happening:
# Create a figure with multiple subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Plot 1: Data and fitted line
axes[0].scatter(X, y, alpha=0.6, label='Data points')
X_line = np.linspace(0, 2, 100).reshape(-1, 1)
y_line = model.predict(X_line)
axes[0].plot(X_line, y_line, color='red', linewidth=2, label='Fitted line')
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].set_title('Linear Regression Fit')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Plot 2: Cost function over iterations
axes[1].plot(model.cost_history)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('MSE Cost')
axes[1].set_title('Cost Function Convergence')
axes[1].grid(True, alpha=0.3)
# Plot 3: Residuals
y_pred = model.predict(X)
residuals = y - y_pred
axes[2].scatter(y_pred, residuals, alpha=0.6)
axes[2].axhline(y=0, color='red', linestyle='--')
axes[2].set_xlabel('Predicted values')
axes[2].set_ylabel('Residuals')
axes[2].set_title('Residual Plot')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('linear_regression_analysis.png', dpi=150)
plt.show()The cost function plot shows how our model improves over iterations—the cost should decrease and eventually plateau. The residual plot helps us verify that our model assumptions are reasonable (residuals should be randomly scattered around zero).
Using Scikit-learn for Comparison
Now let's see how scikit-learn does the same task:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Train sklearn model
sklearn_model = LinearRegression()
sklearn_model.fit(X, y)
# Compare results
print("Comparison of Results:")
print("-" * 40)
print(f"{'Parameter':<15} {'Scratch':<12} {'Sklearn':<12}")
print("-" * 40)
print(f"{'Weight':<15} {model.weights[0]:<12.4f} {sklearn_model.coef_[0]:<12.4f}")
print(f"{'Bias':<15} {model.bias:<12.4f} {sklearn_model.intercept_:<12.4f}")
print(f"{'R-squared':<15} {model.score(X, y):<12.4f} {sklearn_model.score(X, y):<12.4f}")
# Predictions comparison
y_pred_scratch = model.predict(X)
y_pred_sklearn = sklearn_model.predict(X)
print(f"\nMSE (Scratch): {mean_squared_error(y, y_pred_scratch):.4f}")
print(f"MSE (Sklearn): {mean_squared_error(y, y_pred_sklearn):.4f}")You'll notice both models produce nearly identical results. The small differences come from:
- Scikit-learn uses the closed-form solution (Normal Equation) instead of gradient descent
- Our gradient descent might not have fully converged
When to Use Linear Regression
Linear regression is an excellent choice when:
-
The relationship is approximately linear: If plotting your data shows a roughly straight-line pattern, linear regression is a good starting point.
-
Interpretability matters: The coefficients directly tell you how much each feature affects the prediction. In many business contexts, this is invaluable.
-
You have limited data: Linear regression has few parameters, so it doesn't need massive datasets to train reliably.
-
Speed is critical: Training is extremely fast, and predictions are essentially instantaneous.
-
As a baseline: Always start with linear regression before trying more complex models. If it works well enough, why complicate things?
Limitations and Assumptions
Linear regression makes several assumptions that you should verify:
1. Linearity
The relationship between features and target must be linear. If you plot residuals vs. predictions and see a curve, you might need polynomial features or a different model.
2. Independence
Each observation should be independent. Time series data often violates this assumption.
3. Homoscedasticity
The variance of residuals should be constant across all predictions. A "funnel" shape in the residual plot indicates heteroscedasticity.
4. No Multicollinearity
Features shouldn't be highly correlated with each other. This makes coefficients unstable and hard to interpret.
5. Normal Residuals
For statistical inference (confidence intervals, p-values), residuals should be approximately normally distributed.
What to Do When Assumptions Are Violated
- Non-linearity: Try polynomial features, splines, or tree-based models
- Heteroscedasticity: Use weighted least squares or transform the target variable
- Multicollinearity: Remove correlated features or use regularization (Ridge, Lasso)
- Outliers: Consider robust regression methods
Conclusion
Linear regression might be simple, but that simplicity is its strength. By building it from scratch, you've learned:
- How the cost function quantifies model performance
- How gradient descent finds optimal parameters
- How to implement and visualize the entire process
- When linear regression is (and isn't) appropriate
This foundation will serve you well as you explore more complex algorithms. Ridge regression, Lasso, and even neural networks all build on these same principles of cost functions and gradient descent.
The next time you reach for scikit-learn's LinearRegression, you'll know exactly what's happening under the hood. And that understanding makes you a better data scientist.
Happy learning!