Frequentist Statistics 1

Means, variance, and distributions

Props

This document will draw from sources because recreating the wheel isn’t what I’m about. The sources will be listed below:

Danielle Navarro (d.nvarro@unsw.edu.au): https://learningstatisticswithr.com/lsr-0.6.pdf

I’m making a series of posts summarizing basics in statistics for myself and other people who are looking for a quick refresher. I’m filtering out a lot of things that I don’t think practically matter to how we do science. Not to say it doesn’t matter, more to say it doesn’t affect my day to day.

Intro

Why do we need statistics? We need to make objective conclusions about the world. The human brain is ripe with biases that can lead to incorrect conclusions. Statistics helps us come to a consensus about what the “true” answer to a question is.

In 1973, the University of California, Berkeley worried about the admissions of students into their graduate courses. A 9% difference between males and females is definitely a risk in getting sued. At a surface level, it looks like systematic discrimination is occurring.

Number of applicantsPercent admitted
Males844244%
Females432135%

Table: Admission figures for the six largest departments by gender

Department Male Applicants Male Percent Admitted Female Applicants Female Percent admitted


 A               825                   62%                    108                     82%           
 B               560                   63%                    25                      68%           
 C               325                   37%                    593                     34%           
 D               417                   33%                    375                     35%           
 E               191                   28%                    393                     24%           
 F               373                   6%                     341                     7%            

When we actually break it down by department, we can see there are more competitive departments than others. Also, more females apply to the competitive departments than males. Females also have higher acceptance rates and it even looks like there’s a slight bias towards males. This legally clears the university but it doesn’t answer the interesting question of why females are applying to the more competitive courses.

This effect is known as the Simpson’s paradox, not from the show

Mean

You’ve definitely seen this before… The mean is simply every observation summed, divided by the number of observations. This can be mathematically represented like this:

$$\frac 1N\sum_{i=1}^N X_i$$

In R:

data(iris) # load Ronald Fisher Iris dataset
mean(iris$Sepal.Length) #take mean of sepal length
## [1] 5.843333

Median

It’s the middle number. If there isn’t a middle number you take the mean of the adjacent “middle numbers”.

In R:

median(iris$Sepal.Length)
## [1] 5.8

Median v. Mean

Mean is sensitive to outliers in your data so, while the median is not. Taking the median doesn’t actually take in any information about the numbers it just finds the middle one. While the mean actually uses information held in the dataset. If your median is very different from your mean, this could be an indication that there are outliers in your dataset. You should then graph your data to check and then apply the appropriate filters. In the iris dataset, mean and median are comparable so they’re probably aren’t any outliers.

Variability

There are several metrics for this, like range (biggest number - smallest number), interquartile range (75th percentile - 25th percentile). They can be useful and to be honest, I don’t use them. I’m going to skip some of the build up.

In R:

range(iris$Sepal.Length) 
## [1] 4.3 7.9
quantile(iris$Sepal.Length)
##   0%  25%  50%  75% 100% 
##  4.3  5.1  5.8  6.4  7.9

The range here is 1.5. What R is doing, is adding and subtracting the range from the median and that’s why you see two numbers. It’s the ranges from the median.

Variance

Variance is simply the sum of squares $\sum_{i=1}^N (X_{i} - \overline{X})^2$ divided by the number of observations $\frac{1}{N}$. What’s all the rave about variance? It has some nifty statistical properties that prove useful. One of them is that if you have two sets of data and you calculate the variance of variable A as Var(A) = 125.67 and Var(B) = 340.33 and we want to create variable C which = A + B, the variance of C would 466. This means variances are additive.

We can write the whole equation of variance as:

$$Var(X) = \frac{1}{N-1}\sum_{i=1}^N (X_{i} - \overline{X})^2$$

Where did that -1 come from? It’s a penalization we apply to our estimation of the population variance. Since it’s nearly impossible to ever known the true population variance of the variable, we penalize our estimation of it from the sample that we use.

Variance is all fine and dandy but what does it mean anyway? We can interpret it - obviously large variances can be bad but not always. It does have very good uses for other metrics we’re about to cover.

In R:

var(iris$Sepal.Length)
## [1] 0.6856935

Standard Deviation

Standard deviation is basically an interpenetrate version of variance.

$$\hat{\sigma} = \sqrt{\frac{1}{N-1}\sum_{i=1}^N (X_{i} - \overline{X})^2}$$

It’s simply the square root of the variance equation.

In R:

sd(iris$Sepal.Length)
## [1] 0.8280661

Which is the same as:

sqrt(var(iris$Sepal.Length))
## [1] 0.8280661

Now I’m assuming if you’ve done a little bit of research and you have a mild understanding and ability to interpret standard deviation. Our interpretation relies on our assumption that our data is normally distributed (I’ll come back to this). This allows us to say within one standard deviation in each direction 67% of will lay in this region.

We would assume our distribution to look something like this:

xseq<-seq(-4,4,.01)
y<-2*xseq + rnorm(length(xseq),0,5.5)
hist(y, prob=TRUE, ylim=c(0,.06), breaks=20)
curve(dnorm(x, mean(y), sd(y)), add=TRUE, col="darkblue", lwd=2)

In the next post, I’ll discuss normality, measures of normality, and sampling.

Avatar
Mohan Gupta
Psychology PhD Student

My research interests include the testing effect, statistical modelling, and artificial intelligence.

Related