P-Values: Part II

I mentioned in the previous p-values post that sample size is an important factor in whether or not p-values are statistically significant.

In this post I want to demonstrate that with values from the same distribution, even if we know the means (averages) are different, that we can still calculate an insignificant p-value for the statistical test.

The Statistical Test

We will look at the results from what is called a t-test. This statistical test allows us to test whether or not two means are statistically equal or different. I will not go into the details of this test. I will just say that five factors go into the calculation for the test:

  1. group 1 mean
  2. group 2 mean
  3. variance (the variability of the data or how close it is to the mean)
  4. number of samples in group 1
  5. number of samples in group 2

Formally this is called the two tailed independent two sample t-test for equal means with equal variances and sample sizes. This kind of t-test is also known as the Student’s t-test. There are several variations of the t-test, but we will only look at this one.

Let’s imagine we have 3 cases, each with 2 groups, group A and group B. We assume the variance is equal. What we need to find out is whether or not the means are equal for each of these 3 cases.

We have a null hypothesis: Group A and B means are equal.

And an alternative hypothesis: Group A and B means are not equal.

Case 1:

Group A mean: 46.47646

Group B mean: 35.86039

Are the means different?

Case 2:

Group A mean: 47.01035

Group B mean: 37.97637

Are the means different?

Case 3:

Group A mean: 46.56544

Group B mean: 35.15722

Are the means different?

In each case, the group means look different by about 10. Yeah, I’d guess they are different. All group A means seem to be about 46 or 47 and Group B means are 35-37.

Let’s make the statistical test, the t-test. Remember using a p-value of 0.05 as the reference, a p-value greater than 0.05 is not statistically significant (means are not different) and a p-value less than 0.05 is statistically significant (means are different). Here are the results:

Case 1 t-test:

p-value = 0.1505

Case 2 t-test:

p-value = 0.06799

Case 3 t-test:

p-value = 0.0003652

Hypothesis conclusions:

Case 1: With a p-value of 0.1505, we fail to reject the null hypothesis that the means are equal at a 0.05 significance level.

Case 2: With a p-value of 0.06799, we fail to reject the null hypothesis that the means are equal at a 0.05 significance level.

Case 3: With a p-value of 0.0003652, we reject the null hypothesis that the means are equal at a 0.05 significance level.

So only in case 3 are the means statistically different, at a 0.05 significance level. Case 2 was close, with a p-value of 0.06799 but it is still above the cutoff point. Case 1 wasn’t even close.

What’s happening?

In each case the means for each group are about the same (group A mean is similar is all cases, same for group B) and we assumed the variance was equal. How do we get different results based on the t-test?

Sample size!

I did not tell you the number of data points in each case.

The difference is that case 1 has 5  data points in each group, case 2 has 10 in each group, and case 3 has 20 in each group. I stated that the sample size (number of data points) in each group is a factor in the calculation of the t-test. This is a demonstration of what happens with different sample sizes.

For group A, case 1, I simulated 5 data points from a normal (bell curve) distribution with a mean of 50, and a standard deviation of 10 (variance of 100). For case 2, I added 5 more (10 total) data points from the same distribution. For case 3, I added 10 more (20 total) data points.

I did the exact same thing for group B, however with a mean of 40, and the same variance.

Note that this isn’t a very rigorous comparison; with different data points simulated from the same distribution, I would get different results for the t-test. I could even get a statistically significant result (p-value < 0.05) with only 5 samples in each group, or I could get a non-statistically significant result (p-value > 0.05) with 20 samples in each group. I am sure that this demonstration would annoy most statisticians….actually, come to think of it, I’m a bit annoyed at myself. But it was only to demonstrate the effect of the sample size on the result of the t-test. It is made to be as simple as possible in order to understand the concept. The concept that:

As the sample sizes increase, it becomes easier to detect differences in group means.

That is why we always prefer to have larger sample sizes. Researches want enough samples to be able to obtain statistically significant results, but more samples mean higher costs for the study. Sometimes it is not possible to obtain enough funding for large sample sizes. A statistical test comparing two group means could end up with non-significant difference just because the sample size was too small. Likewise researchers don’t want to waste resources with unnecessarily large samples. There can be other problems with large sample sizes. Often, they can produce statistical significant (p-value < 0.05) results when their effect size is biologically insignificant. Yes, another post in the future!

There are ways to estimate what sample size you need for your study, based on parameters such as level of desired significance, power, expected effect size, and standard deviation. This will be discussed in more detail in a future post.

R-code

To generate values from the normal distribution there is an r function, rnorm, which does just that. We enter the number of samples, the mean, and the standard deviation. Note that I have ‘set the seed’ for the random number generation. This allows us to repeat the simulation with the same ‘random’ numbers. I know, then they aren’t really random…they are pseudo-random. Don’t worry about that. The rest of the code is pretty self explanatory.

set.seed(1234)
small_a=rnorm(5,50,10)
small_b=rnorm(5,40,10)

med_a=c(small_a,rnorm(5,50,10))
med_b=c(small_b,rnorm(5,40,10))

large_a=c(med_a,rnorm(10,50,10))
large_b=c(med_b,rnorm(10,40,10))

mean(small_a)
mean(small_b)
mean(med_a)
mean(med_b)
mean(large_a)
mean(large_b)

t.test(small_a,small_b,var.equal = TRUE)
t.test(med_a,med_b,var.equal = TRUE)
t.test(large_a,large_b,var.equal = TRUE)

Conclusion

Sample size is an important factor in statistical tests.

As the sample sizes increase, it becomes easier to detect differences in group means.

P-values are just a part of the statistical analysis.

I’ll finish with a quote by the statistician Gene Glass:

Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.

One thought on “P-Values: Part II

Leave a comment