Nov 19, 2011

p-value truths and fallacies, according to Rothman and Greenland


Upper/Lower one-tailed p-value
The probability that the test statistic (such as a t-statistic or a chi-square statistic), will be greater/lesser than or equal to its observed value, assuming that (a) the test hypothesis is correct and (b) there is no source of bias in the data collection or analysis processes.
  • must fall between 0 and 1
The two-tailed p-value
Twice the smaller of the upper and lower P-values
  • may exceed 1
Properties of the p-value
A small P-value represents a low probability of getting a test statistic as extreme or more extreme than the observed statistic, assuming that (a) the test hypothesis is correct and (b) there is no source of bias in the data collection or analysis processes. Or one of the assumptions used to derive it is incorrect, that is, either or both the test hypothesis (assumption 1) or the statistical model (assumption 2) is incorrect.

Cautious interpretation of a small p-value from Fisher, 1943
There is a problem with the test hypothesis or the study or with both.

What can we do with p-values?

Berger and Delampady, 1987; Berger and Sellke, 1987
Compute a Bayesian probability or credibility for the test hypothesis. It will always be far from the two-tailed p-value.

Casella and Berger, 1987
A one-tailed P-value can be used to put a lower bound on the Bayesian probability of certain compound hypotheses and under certain conditions will approximate the Bayesian probability
that the true association is the opposite of the direction observed.

What p-values are NOT.
Berger and Sellke, 1987; Goodman and Royall, 1988; Royall, 1997; Edwards, 1992
P-value for a simple test hypothesis (for example, that exposure and disease are unassociated) is not a probability of that hypothesis: That P-value is usually much smaller than such a Bayesian probability and so can easily mislead one into inappropriately rejecting the test hypothesis. The likelihood of a hypothesis is usually much smaller than the P-value for the hypothesis, because the P-value includes not only the probability of the observed data under the test hypothesis, but also the probabilities for all other possible data configurations in which the test statistic was more extreme than that observed. So it is inflated (or lower than what is real) with the joint probability of more extreme data. 

The P-value refers to the size of the test statistic (which could be the estimate divided by its estimated standard deviation), not to the strength or size of the estimated association. 
{Not sure I understand this calculation. This is usually not how chi-square test-statistics are calculated. @confusion}

A major problem with the P-values and tests in common use (including all commercial software) is that the assumed models make no allowance for sources of bias, apart from confounding by controlled covariates.

If using the Neymann Pearson Model:
When a single study forms the sole basis for a choice between two alternative actions, as in industrial quality-control activities, a decision-making mode of analysis may be justifiable. In a public health setting, such decisions are inevitably based on results from a collection of studies, and proper combination of the information from the studies requires more than just a classification of each study into significant! or not significant!. Thus, degradation of information about an effect into a simple dichotomy is counterproductive, even for decision making, and can be misleading.

Type I and Type II errors arise:
because the investigator has attempted to dichotomize the results of a study into the categories significant! or not significant.! Because this degradation of the study information is unnecessary, an error that results from an incorrect classification of the study result is also unnecessary.

Freedman et al., 2007
The origin of the nearly universal acceptance of the 5% cutoff point for significant findings is tied to the abridged form in which the chi-square table was originally published.

For epidemiologic effect measures, two-sided alternative hypothesis will range from absurdly large preventive effects to absurdly large causal effects, and include everything in between except the test hypothesis. This hypothesis will be compatible with any observed data. The test hypothesis, on the other hand, corresponds to a single value of effect and therefore is readily consistent with a much narrower range of possible outcomes for the data. Statistical hypothesis testing amounts to an attempt to falsify the test hypothesis.

So the best approach
Either the null is wrong or the statistical model is wrong. Fisher, 1943

References:
Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Lippincott Williams & Wilkins; 2008. Available from: http://www.worldcat.org/isbn/0781755646.

0 Comments:

Post a Comment