If you’ve read some of my other posts like the lasso or the ridge regressions, you might be able to skip this post. I’m going to review the basics and then focus on how to do multiple linear regression.

## Ordinary Least Squares

Similar to a Pearson correlation, linear regressions lend us more powerful tools to interpret relationships between different variables. The basics of linear regression start with $Y = mx +b$. Remember this formula that you’ve seen a million times? If you can understand this formula then you understand linear regression, you might just not know it yet. When drawing the regression line, we can define the model as:

$$\hat{Y_i} = \beta_0 + \beta_1X_i + \epsilon_i$$

Don’t worry about the hat - it means estimate. If you don’t have it in a stats class, you’ll get marked down for it. I might forget it occasionally, and functionally, it doesn’t mean anything. It’s the same formula as early - the i’s stand for individual observations, $\beta_0$ is our intercept, $\epsilon$ is our error term, and $\hat{Y}_i$ is the estimated value of the model. The error term, or residual, is the difference between what our model predicts and the actual observation:

$$\epsilon_i = \overline{Y}_i - \hat{Y_i}$$

We’re ready right? Just one sec… how do we get the $\beta$ values? Just like everything else in statistics, we have to estimate them. The goal in estimating the betas is to minimize the sum squared residuals or $\sum_i(\overline{Y}_i-\hat{Y_i})^2$. There’s more than one way to skin a cat, erm rather statistic in this case. The way I’ll discuss here is the one taught religiously in classes across the nation: ordinary least squares (OLS). Beta is equal to the covariance between y and x divided by the variance of x:

$$\hat{B_{xy}} = \frac{\sum_{i=1}^n(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n(x_i - \overline{x})^2}$$

We also have to calculate the intercept which is simply, $\beta_0 = \overline{Y} - \hat{\beta}\overline{X}$, the mean of the dependent variable minus the estimated beta multiplied by the mean of the independent variable.

So now we can calculate the full equation is! $$Y = \beta_0 + \beta_1X_1 + \epsilon$$

Let’s try this in R - let’s say we’re interested seeing if sepal length and width are associated.

```
data(iris)
lm.model = lm(Sepal.Length ~ Sepal.Width,data = iris)
summary(lm.model)
```

```
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5561 -0.6333 -0.1120 0.5579 2.2226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5262 0.4789 13.63 <2e-16 ***
## Sepal.Width -0.2234 0.1551 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
```

A little about the function, you put the y variable, the dependent variable and then a ~ followed by your independent variable. Great, so we see that there is a significant relation between sepal length and width with an $R^2$ of 0.0138, meaning width explains about 1% of variance length. $R^2$ tells us how well our model fits the data, and can be interpreted as the % variance explained by the model.

## Multiple Linear Regression

Let’s make things a little more complicated. What if we want to include more things into our model, like petal width as well. The equation is now:

$$Y = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + \epsilon$$

Where $Y$ is still the dependent variable and $X_i$ is the explanatory variable in the model and $\beta$ is the slope coefficient for each variable. $\beta_0$ is still the slope and $\epsilon$ is still the error term.

If you wanted to add more variables, a general version looks like:

$$Y = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + … + \beta_pX_p + \epsilon$$

Where p fills in for the explanatory variable.

In R, it’s simple enough:

```
fit = lm(Sepal.Length ~ Sepal.Width + Petal.Width , data = iris)
summary(fit)
```

```
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2076 -0.2288 -0.0450 0.2266 1.1810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.45733 0.30919 11.18 < 2e-16 ***
## Sepal.Width 0.39907 0.09111 4.38 2.24e-05 ***
## Petal.Width 0.97213 0.05210 18.66 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4511 on 147 degrees of freedom
## Multiple R-squared: 0.7072, Adjusted R-squared: 0.7033
## F-statistic: 177.6 on 2 and 147 DF, p-value: < 2.2e-16
```

Here we can see that the petal width added a lot in terms of model fit with a $R^2$ value of 0.7072, meaning our model explains about 70% of the variance of the sepal length. We also see that each independent variable is significantly associated with the dependent variable.

Assumptions:

- Linear relationship between independent and dependent variables
- The independent variables are not highly correlated
- Observations are randomly sampled
- Residuals are normally distributed

*In the next post, I’ll discuss MANOVAs.*