P-values: Part I

P-value: just the thought of it can elicit hope and fear from researchers everywhere. P-values are all over the scientific literature. To researchers, a p-value of more than 0.05 could show that their research experiment was a waste of time, whereas a p-value of less than 0.05 could mean to them that their hypothesis was correct. They proved their theory!

Careers are built on p-values.

But what does a p-value mean? Why 0.05? Does it prove anything? Why is it everywhere?

The p-value has a rather technical definition. And in order to understand the definition you have to understand many other concepts that lie behind the p-value. Because of this, p-values are often misinterpreted or interpreted in a sloppy way, even in scientific literature.

In this blog post I hope that you will come to an understanding of what the p-value is and isn’t, what is shows and doesn’t show, some problems with the p-value, and how it can be interpreted. I am not aiming for a highly technical understanding and I am not going to tell you how to calculate it. I wish to convey an understanding which will allow you to read studies and interpret the p-values in the proper way.

P-value: a non-technical definition

A value which, in a standardized way, quantifies the level of significance of a statistical test based on the data.

I think that definition may be as simple as it can get but we need to clear up 3 things.

Standardized: By this I mean that the p-value is always a value is from 0 to 1. It is a probability. A p-value from one statistical test can be interpreted (usually) in the same way as one from another statistical test, in a general sense.

Level of significance: For p-values, one concept that you must understand is that smaller p-values represent higher levels of significance and higher p-values represent lower levels of significance.

Read that a couple times.

A p-value of 0.001 is highly significant whereas a p-value of 0.42 is not significant at all.

Statistical Test: In a statistical test, we need to have a hypothesis and data which should give us evidence for or against that hypothesis. We have a sample of data, i.e. we are not collecting data on the entire population of interest, just a part of it. The statistical test determines whether our evidence (data) supports rejecting the null hypothesis. This is always the ‘conservative,’ or ‘default’ hypothesis; that there is no relationship or no difference between the groups or variables we are measuring. We assume the null hypothesis is true. If the result of the statistical test is a small p-value, usually less than 0.05, then we have significant evidence to reject the null hypothesis. If the p-value is greater than 0.05 then we say that we fail to reject the null hypothesis.

From the level of significance concept stated above, a higher level of significance (smaller p-value) means more support to reject the null hypothesis, and similarly, a lower level of significance (larger p-values) means more support NOT to reject the null hypothesis.

There are very many statistical tests so it is definitely a good thing that we have the p-value as a standard way to interpret a test.

Now let’s repeat the definition:

A value which, in a standardized way, quantifies the level of significance of a statistical test based on the data.

Hopefully it makes more sense now.

P-value: a technical definition

To confuse you, here are two technical definitions of the p-value.

The p-value is defined as the probability, under the assumption of the null hypothesis of obtaining a result equal to or more extreme than what was actually observed. (Wikipedia)

P-value: The conditional probability that an observed effect or one larger is due to chance given that the null hypothesis is true. (from Clinical Trials: A Methodological Perspective)

Got it?

Why 0.05?

It is commonly accepted that p-values of less than 0.05 are ‘statistically significant’ and p-values of more than 0.05 are ‘not statistically significant. Why 0.05? Why 0.05?

It is just the convention. There is nothing magic about the 0.05 level. It could be 0.035. And in fact, some disciplines prefer smaller p-values. Apparently one of the reasons is that Fisher, the originator of many statistical tests and concepts, thought that a 1/20 or 0.05 level seemed to be a sufficient level.

Whatever the case, the 0.05 significance level is now commonly used as the cutoff point for significance/non-significance.

What the p-value isn’t, and what is doesn’t do

The p-value isn’t a lot of things, but commonly gets mistaken for them. Some of these are technical issues, but some are important concepts to understand for anyone wanting to interpret p-values. Here’s a list, which will probably be added to:

  • It is not the probability that the null or the alternative hypothesis is true or false.
    • A small p-value does not prove that the alternative is true
    • A large p-value does not prove that the null hypothesis is true

In traditional (frequentist) statistics, we cannot get a direct probability of whether or not hypothesis is true. Sorry.

    • It is not the probability that the results are due to chance or not due to chance.
    • It is not the probability that the same results would occur if we repeated the experiment
    • It is not the probability that we will incorrectly reject the null hypothesis
    • It is not a measure of effect size
    • It does not indicate the direction of any association
    • It is not the be-all and end-all of the experiment

OK, so what is it?

Restating the second definition above, the p-value is:

Given that the null hypothesis is true, it is the conditional probability that an observed effect or one larger is due to chance.

Yes, it is confusing.

An Example of P-value Interpretation

Let’s look at an example from a clinical trial. This study is called, “High-Dose N-Acetylcysteine in Stable COPD.” (chronic obstructive pulmonary disease)
Here’s a quote from the results section:

“At 1 year, there was a significant improvement in forced expiratory flow 25% to 75% (P=.037)

The P = .037 is the reporting of a p-value.

In the materials and methods sections they state that “eligible patients with COPD were randomly allocated to NAC 600 mg bid or placebo.” So those are the two groups we are comparing: the placebo (no treatment) group, and the NAC treatment group. NAC is N-acetylcysteine, an amino acid with anti-oxidant / anti-inflammatory properties. The researchers state that it “may be beneficial in COPD.”

For their statistical test, what is the null hypothesis?

There is no difference between the outcome (forced expiratory flow) for the placebo and treatment groups.

And the alternative hypothesis?

There is a difference between the outcomes for the placebo and treatment groups.

According to their p-value of 0.037, we can reject the null hypothesis at a significance level of 0.05. That is, since the observed p-value of 0.037 is less than the reference or cutoff p-value of 0.05, we can reject the null hypothesis.

Note that we haven’t proven the null hypothesis is false and we haven’t proven that the alternative is true. Nor have we calculated the probability that either one is true or false.

You may think, “but 0.037 is pretty close to 0.05. Doesn’t that show that NAC doesn’t really help much?”

Maybe, maybe not. A lot goes in to calculating p-values. One important factor is the sample size, or the number of subjects in the experiment. Even if the treatment had a large beneficial effect, we might still get a large p-value (insignificant result) if we had a small sample size.

We’ll look more into what goes in to calculating the p-values in part II of the p-values series.

Conclusion

P-values:

A value which, in a standardized way, quantifies the level of significance of a statistical test based on the data.

or if you prefer the more technical version:

Given that the null hypothesis is true, it is the conditional probability that an observed effect or one larger is due to chance.

Smaller p-values are higher levels of significance and higher p-values are lower levels of significance.

The convention is that p-values less than 0.05 are statistically significant and p-values larger than 0.05 are not statistically significant.

There is nothing magic about the 0.05 significance level.

They are not the be-all and end-all of research. There are actually arguably much more important metrics. Research should not be judged solely by the p-value.

They are commonly misinterpreted. Try not to.

I think the main takeaway should be that p-values are just a part of the statistical analysis and should not constitute the whole reason for deeming an experiment a success or failure. They should be used in concert with other metrics to determine the level of evidence for or against a hypothesis.


P.S.

It may come as a surprise, but the p-value, what they mean and what they show is a very controversial topic, especially (well…probably mostly only) among statisticians. Statisticians seem to enjoying arguing about all the highly technical problems with p-values. I am sure some would argue that the definitions I’ve given aren’t exactly correct.

There is even at least one scientific journal that has banned p-value reporting. And there’s a whole branch of statistics, Bayesian statistics, which doesn’t even use p-values and has been becoming more widespread recently.

Oh, and if you ever want to get a statistician’s ears to perk up, start a sentence with: This p-value proves that…

One thought on “P-values: Part I

Leave a comment