Understanding Ridge and Lasso Regression: A Comprehensive Guide
Written on
Introduction to Regularization Techniques
If you're venturing into machine learning, it's crucial to consider alternatives to linear regression.
Challenges with Linear Regression
Linear Regression, also known as Ordinary Least Squares (OLS), is one of the simplest and most commonly utilized machine learning algorithms. However, it has a significant drawback: it is particularly susceptible to overfitting the training data.
In the most straightforward case—2D data—the line of best fit is visually represented as the line that minimizes the sum of squared residuals (SSR). The mathematical representation is also relatively simple.
Nonetheless, as the number of predictor variables (or dimensions) grows, the coefficients ( beta_i ) tend to increase significantly. With these inflated coefficients, it becomes straightforward to predict outcomes—simply combine the relevant slopes (( beta )) to get your result. Thus, overfitting becomes a common issue with linear regression models.
Another shortcoming of linear regression is its disregard for feature weights. As long as a linear relationship exists between features, the model will yield an OLS-minimized output. However, in practice, it is vital for models to account for the significance of each feature. For instance, if you aim to predict the number of newborns in a town, a key factor would be the number of fertile women. Conversely, the number of storks in that town would be entirely irrelevant. Yet, if these variables happen to correlate linearly and both are included in the regression, the resulting model may be subpar, even if it achieves an OLS minimum.
In reality, datasets today often consist of numerous features, both pertinent and irrelevant, impacting the target variable you want to predict.
You can explore the notebook for this article on Kaggle.
Understanding Bias and Variance
To grasp how Ridge Regression addresses the aforementioned challenges, we must first delve into the concepts of bias and variance.
Variance is linked to a model's poor performance on a new (test) dataset. High variance often leads to overfitting, as models that overfit yield wildly different results across various datasets. On the flip side, bias refers to a model's inability to generalize within the training data. A model with excessive bias fails to generalize effectively on both training and test sets.
Ideally, a perfect model would possess low bias and low variance, but achieving this balance is often easier said than done. Bias and variance exist in a trade-off relationship concerning model complexity.
Model complexity is influenced by the number of dimensions input into the model, or the number of features. Linear regression is unbiased (it does not consider predictor significance), allowing it to fit training data effectively. However, this tailored fit results in very high variance, positioning Linear Regression at the extreme right of the Model Complexity Plot.
Regularization Techniques: Ridge and Lasso
Both bias-related issues and overfitting can be adeptly addressed through Ridge and Lasso Regression. While the equation for the line of best fit remains unchanged for these regressions, the cost function does introduce a new hyperparameter, ( lambda ).
In the case of Ridge Regression, the cost function squares each coefficient of the feature variables and scales them by a factor ( lambda ). This is referred to as the Ridge Regression penalty. The result of this penalty is a reduction in all coefficients (slopes). This process has a dual effect: it mitigates overfitting through lower coefficients, and it introduces some bias into the otherwise unbiased linear regression model.
To illustrate, if you select a low value for ( lambda ) (such as 0.1), all coefficients—both large and small—will be scaled down. This scaling effect allows Ridge Regression to emphasize significant features while diminishing the influence of less relevant ones, leading to a more streamlined model.
You may wonder if this added sum of scaled, squared slopes would yield a worse fit than traditional OLS. While it’s true that the initial fit may be less precise, Ridge and Lasso ultimately provide better and more reliable predictions over time. By incorporating a small degree of bias, we can achieve a notable reduction in variance.
Let's put Ridge Regression into practice using Scikit-learn. Ridge follows the same API as other models offered by sklearn. We will utilize the Ames Housing Dataset from Kaggle, predicting house prices using a selection of features.
First, we will fit a Linear Regression model and compare its performance against Ridge using the Mean Absolute Error (MAE). Before proceeding, some preprocessing steps, such as scaling features and addressing missing values, are necessary. Thus, we will create a straightforward Pipeline instance to handle these tasks.
For further insights into using pipelines with sklearn, check out this article or the associated notebook on Kaggle.
Now, let's fit a Linear Regressor and observe the results.
The testing score is significantly lower than the training score, indicating overfitting. Next, let's explore Ridge:
We observe nearly identical results for Ridge, as the alpha value we selected is too small. Notably, in the sklearn API, the hyperparameter ( lambda ) is represented as alpha, so do not confuse the two.
Rather than randomly testing various alpha values, we can employ cross-validation through RidgeCV. This method assesses multiple alpha values using cross-validation, similar to GridSearch:
We will pass a range of alpha values from 1 to 100, stepping by 5, under a 10-fold cross-validation. After fitting, we can identify the optimal alpha using the .alpha_ attribute:
>>> ridge.alpha_
86
Finally, we will evaluate Ridge with this hyperparameter and compare it to Linear Regression.
Unfortunately, even with the best alpha value, we see nearly identical results to Linear Regression. (It later became apparent that this dataset was not ideal for demonstrating Ridge and Lasso.)
Regularization with Lasso Regression
Lasso Regression shares many similarities with Ridge, with a minor distinction in the cost function: instead of squaring each coefficient, Lasso employs their absolute values. The rest of the process remains largely unchanged. Let's examine the Lasso regressor using a new dataset, the built-in diamonds dataset from Seaborn.
diamonds = sns.load_dataset('diamonds')
diamonds.head()
Using all features, we will predict the price with Lasso. We will import Lasso and LassoCV similarly and determine the best alpha value through cross-validation.
It seems that selecting a very low alpha yields satisfactory results. Let's proceed to fit the model with this value and assess its performance.
Lasso regression performed admirably. A notable advantage of Lasso is its capability for feature selection. Due to its operational mechanics, Lasso can reduce insignificant parameters to zero. We can illustrate this by plotting the fitted Lasso coefficients using its coef_ attribute:
As demonstrated, only the 'carat' feature appears to significantly influence diamond pricing, with all other coefficients being reduced to nearly zero.
This distinction highlights the primary difference between Ridge and Lasso: while Lasso can shrink coefficients to zero, Ridge cannot.
Conclusion
In summary, Ridge and Lasso are two of the most vital regularization techniques available. Both effectively address the shortcomings of standard linear regression by introducing bias that leads to reduced variance, thus preventing overfitting. The hyperparameter ( lambda ) regulates the severity of regularization for features of varying magnitudes.
While we have covered a lot, we have not delved deeply into the complex mathematics that underpin these elegant algorithms. Therefore, I am leaving a few references to resources that can provide further insight:
- StatQuests on Ridge and Lasso
- The Mathematics Behind Linear, Ridge, and Lasso Regression
- Understanding the Math Behind Ridge and Lasso Regularization
- Lasso Regression and its Sparsity vs. Ridge Regression — Exploring the Math
This video tutorial covers Ridge and Lasso Regression using Python and SciKit Learn, focusing on regularization methods and their applications.
In this second video, you'll find a detailed walkthrough of Ridge and Lasso Regression using Python and Sklearn, which is beneficial for practical implementation.