Linear and Logistic regressions are usually the first modeling algorithms that people learn for Machine Learning and Data Science. Both are great since they’re easy to use and interpret. However, their inherent simplicity also comes with a few drawbacks and in many cases they’re not really the best choice of regression model. There are in fact several different types of regressions, each with their own strengths and weaknesses.
Regression is a technique used to model and analyze the relationships between variables and often times how they contribute and are related to producing a particular outcome together. A linear regression refers to a regression model that is completely made up of linear variables. Beginning with the simple case, Single Variable Linear Regression is a technique used to model the relationship between a single input independent variable (feature variable) and an output dependent variable using a linear model, i.e., a line.
The more general case is Multi Variable Linear Regression where a model is created for the relationship between multiple independent input variables (feature variables) and an output dependent variable. The model remains linear in that the output is a linear combination of the input variables. We can model a multi-variable linear regression as the following:
Y = a_1*X_1 + a_2*X_2 + a_3*X_3 ……. a_n*X_n + b
Where a_n are the coefficients, X_n are the variables and b is the bias. As we can see, this function does not include any non-linearities and so is only suited for modeling linearly separable data. It is quite easy to understand as we are simply weighing the importance of each feature variable X_n using the coefficient weights a_n. We determine these weights a_n and the bias busing a Stochastic Gradient Descent (SGD).
When we want to create a model that is suitable for handling a non-linearly separable data, we will need to use a polynomial regression. In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points. For a polynomial regression, the power of some independent variables is more than 1. For example, we can have something like:
Y = a_1*X_1 + (a_2)²*X_2 + (a_3)⁴*X_3 ……. a_n*X_n + b
We can have some variables with exponents, others without, and we can also select the exact exponent we want for each variable. However, selecting the exact exponent of each variable naturally requires some knowledge of how the data relates to the output.
A standard linear or polynomial regression will fail in case of high collinearity among the feature variables. Collinearity is the existence of near-linear relationships among the independent variables.
min || Xw - y ||²
Where X represents the feature variables, w represents the weights, and y represents the ground truth. Ridge Regression is a remedial measure taken to alleviate collinearity amongst regression predictor variables in a model. Collinearity is a phenomenon in which one feature variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. Since the feature variables are so correlated in this way, the final regression model is quite restricted and rigid in its approximation, i.e., it has high variance.
To alleviate this issue, Ridge Regression adds a small squared bias factor to the variables:
min || Xw — y ||² + z|| w ||²
Such a squared bias factor pulls the feature variable coefficients away from the rigidness and introduces a small amount of bias into the model but with a great reduction of the variance value.
Lasso Regression is quite similar to Ridge Regression in that both techniques have the same premise. We are again adding a biasing term to the regression optimization function in order to reduce the effect of collinearity and thus, the model variance. However, instead of using a squared bias like ridge regression, lasso instead using an absolute value bias:
min || Xw — y ||² + z|| w ||
ElasticNet is a hybrid of Lasso and Ridge Regression techniques. It uses both the L1 and L2 regularization taking on the effects of both techniques:
min || Xw — y ||² + z_1|| w || + z_2|| w ||²