But why this definition of severity?
(Since this question is connected with one of the essential features of Mayo's Error Statistics, I will try to spell out the problems involved, from the very elementary level, although that will make my exposition longer. I have corrected some of my misunderstandings.)
But why this definition of severity?
Again, Mayo is not kind to the reader. Her definition of severity appears rather abruptly, and with no explicit reference to her previous arguments or preparations in the previous five chapters. What she says, immediately before she introduces the definition of severity, is essentially this much:
The cornerstone of an experiment is to do something to make the data say something beyond what they would say if one passively came across them. The goal of this active intervention is to ensure that, with high probability, erroneous attributions of experimental results are avoided. The error of concern is passing H is that one will do so while H is not true. Passing a severe test, in the sense i have been advocating, counts for hypothesis h because it corresponds to having good reasons for ruling out specific versions and degrees of this mistake. (178)
Then comes her first statement of a severe test, as quoted in the main page; that is, a passing result is a severe test of hypothesis H just to the extent that it is very improbable for such a passing result to occur, were H false. But our immediate response is: how is this notion of severity related to statistical tests explained so far? We know what "significance level" is, or what "experimental distribution" is (see pp. 158-9), but they are essentially defined in terms of the probability of an outcome, given hypothesis H is true, not given H is false! And so far, Mayo has not given any hint as regards how we should compute probabilitites, given hypothesis H is false. Thus a strong uneasiness is produced in the reader's mind; Mayo does not begin to discuss this until page 195, and the discussion there is disappointing.
Mayo is not good at using simple examples effectively, in order to make her arguments more intelligible (and, consequently, more susceptible to criticisms). Since Mayo (and the reader) has already a couple of examples in the repertoire, why not use them immediately? Let's look at the Binomial Experiment (Lady tasting tea). In this example we had two hypotheses, H0 and H':
H0: the lady is guessing, p = 0.5
H': the lady does better than guessing, p > 0.5
And given the null hypothesis, we could obtain experimental distribution for 100 trials. All right, this time, let us take H' as our test (null) hypothesis; then Mayo's definition of severity can be easily illustrated. Let f signify the observed relative frequency of "success" (i.e. the lady's judgment is correct) in 100 trials; then we already know that
P(f的0.6 | H0) = 0.03.
Since in this context 芣H' = H0, we have an indication of the severity of test for our hypothesis H', according to Mayo's definition. Suppose we obtained the result that f的0.6, which may be abbreviated as e. Then, the test of H' by means of this result e is as severe as 97 percent (in terms of percentage).
However, it must be noticed that if we take H0 as our test hypothesis, the situation is not as easy as this, since we have to be able to calculate the probability of some result conditional on the falsity of H0. For instance,
P(f的0.6 | 芣H0) = P(f的0.6 | H') .
Since H' does not specify the value of p, how can we obtain this probability? Notice that H' in this case becomes quite similar to the "Bayesian catchall" as Mayo puts it (a disjunction of an indefinite number of hypotheses). Literally, the negation of H0 is nothing but a disjunction of all hypotheses assigning any value (between 0 and 1; which amounts to saying, informally, that the lady does better or worse than mere guessing) except for 0.5; so how should Mayo obtain the probability of e on such a disjunction? I don't see any difference between the Baysian difficulty and Mayo's difficulty in this regard. Suppose, imitating Laplace (the principle of indifference), we assume that any value for p is equally possible; then it should be the case (!) that
P(f的0.6 | 芣H0) = P(f的0.6 | H') = P(f的0.6 | H0) = 0.03.
This means that the test by e is as severe for H0 as for 芣H0. This seems to be quite disastrous for Mayo's definition; for the same result e counts as a good evidence for both H0 and 芣H0! Notice that this suggests a stronger worry than Earman's worry, treated and answered by Mayo in 6.3. Earman suspected whether we can obtain a low probability in case test hypothesis is false; and he presented a case in terms of higher-level alternatives. But my example suggests a far stronger worry that both the null hypothesis and its rival hypothesis may give the same probability to the same evidence, the two hypotheses being low-level alternatives to each other! If she wants to stick to her definition, she has to show that, on her account of error statistics, this sort of counterexample never appears.
Beginners may have some difficulty for understanding the preceding result; so let me give you a simpler version (a classical example, from the history of probability theory), in terms of finite alternative hypotheses. Let H0 be the same as before, and suppose there are 4 other alternatives (given the background information):
- H1: p=0.00
- H2: p=0.25
- H3: p=0.75
- H4: p=1.00
On our assumption, 芣H0 is equivalent to the disjunction of these four. Then, given that each alternative hypothesis is equally probable (according to the Laplacean principle of indifference), the probability of any result e on H0 is the same as the probability of e on 芣H0 (i.e., on the disjunction of the four). It suffices to show this for
e = the Lady's judgment is correct.
Then, clearly,
- P(e | H1) = 0.00
- P(e | H2) = 0.25
- P(e | H3) = 0.75
- P(e | H4) = 1.00.
Since each hypothesis is equiprobable, the probability of e on the disjunction of the four is clearly the mean of these predictions, i.e., P(e | 芣H0) = 0.50, which is exactly the same as P(e | H0) . Given this, it is easy to see that the same result holds for any evidence statement e. And, the case where there are infinitely many hypotheses (as regards the value of p), is not any different, in principle, from this simpler version.
Moreover, this example is not as unrealistic as it may seem at first sight. For, what is crucial is only that the alternative hypotheses are distributed symmetrically around the null hypothesis; we can find many such cases, when we wish to ascertain the correct value of a parameter.
Finally, it may be pointed out that Mayo's defence of the error statistical approach agianst this sort of counterexample, in terms of piecemeal character of experimental learning, does not help much. She says,
Within an experimental testing model, the falsity of a primary hypothesis H takes on a specific meaning. If H states that a parameter is greater than some value c, not-H states that it is less than c; if H states that factor x is responsible for at least p percent of an effect, not-H states that it is responsible for less than p percent; if H states that an effect is caused by factor f, for example, neutral currents, not-H may say that it is caused by some other factor possibly operative in the experimental context ...; if H states that the effect is systematic--of the sort brought about more often than by chance--then not-H states that it is due to chance. How specific the question is depends upon what is required to ensure a good chance of leaning something of interest ... (190-1)
I agree with this; and most other Bayesians will join me in this. But this does not help in the least for solving the difficulty posed by my counterexample. Thus, all Mayo suggests later (on p. 195) is that:
(1) the probability of an outcome conditional on a disjunction of alternatives is not generally a legitimate quantity for a frequentist; and
(2) the severity criterion (SC) requires that the severity be high against each of single alternatives (not a disjunction of such).
(1) amounts to saying that her definition of severity is not appropriate for many of her canonical models of experimental inquiry, and (2) is nothing but a substantial revision of her definition. If (2) is what she really wants (and indeed this seems to be the case, judging from her subsequent discussions, long after, in chapter 11, p. 397), she should have changed the definition in the first place. It seems to me that, in short, she has chosen a wrong way to state her crucial definition, and this may easily give the impression that she simply wishes to evade the whole question by this maneuvre (1) and (2). It looks quite strange to demand that you should not ask the severity of the test for H0 (but OK for H'), when you are trying to test H0 against H' (芣H0), in one of her canonical models.
Moreover, we can easily construct a test situation in which it is legitimate for the frequentist to obtain P(e | 芣H0). Suppose there are a small number of (say, 5) coins, biased or unbiased (as our five hypotheses tell, respectively), and one is chosen at random; and you are asked to determine by experiment which coin is in fact chosen. In this case, it is legitimate, even for the frequentist, to ask the probability P(e | 芣H0), and our counterexample is fully alive. And our intuition is, if the observed frequency of head is close to 0.5 and the number of trials is large enough, this is a good evidence for H0 but not for 芣H0. But, unfortunately, Mayo's definition of severity does not work for this case.
See also Mayo vs. Earman
MUD in relation to Severity, Likelihood, and the Length of Experiment
Now, suppose what she really wants to say is (2), not her official definition of SC. Then, how can we be sure that the severity saves us from the methodological underdetermination (MUD)? As Mayo says on p. 203, I may agree to exclude "practically indistinguishable alternatives" from our consideration. But still there can be many alternative hypotheses which give a value close enough to the one the null hypothesis gives, so that the severity can be also close; but given enough length of experiment, they can be distinguished. In other words, given any (predetermined) length of experiment, there are infinitely many alternative hypotheses indistinguishable in terms of the severity of the test by the obtained data; another way to put this is: given two alternative hypotheses close to each other, you can distinguish them in terms of severity of test, if you continue experiment long enough. These two versions simply depend on how the condition of experiment is specified: which is given first, the length of experiment, or the set of alternative hypotheses? Thus one of the obvious consequences of her revised definition is that you have got to refer to the length of experiment (in other words, the length becomes relevant to error probabilities), if you want enough severity for distinguishing two given alternative hypotheses which say two similar values of the parameter in question (say, 0.50 and 0.52), since such hypotheses give similar experimental distributions (therefore, only a long enough experiment can decide which to reject or to accept).
To make our point more specific, take the Bionimial Experiment again, and let the null hypothesis (H0) be "p=0.50" and the alternative hypothesis (H') be "p=0.52". Then the severity of the test (H0 vis-a-vis H') with the data e (f的0.6 in 100 trials) becomes 97 percent for the alternative hypothesis H'; that is,
(a) P(e | H0) = 0.03.
However, we now have to notice that Mayo gives several different versions in stating the condition for severity. One version is this (p. 193):
(b) P(test T passes H', given that H' is false [H0 is true]) = 0.03.
Although this is in effect equivalent to (a), provided that the cutoff point (significance level) is 0.3, she seems to have intentionally moved up to the metalanguage level. Thus, when you calculate the severity of test for the null hypothesis H0, you have got to choose one alternative hypothesis with the lowest value (in this case) which falsifies H0 (see p. 195); so let us suppose "p=0.52" is this alternative H'. Then the severity for H0 should be calculated, according to Mayo, as follows (see p. 397 as well as p. 195): Consider the probability
(c) P(test T passes H0 [fails to reject H0] , given H' is true).
And notice that, in the context of testing H0 vis-a-vis H', "failing to reject H0" is equivalent to obtaining a result that is not so far from 0.5 as to reach the cutoff point (i.e., in our example, 2-standard-deviation point). So the probability of (c) is the probability that the result is less than the cutoff point, given H' is true; i.e.,
(d) P(test T passes H0 [fails to reject H0] , given H' is true) = P(芣e | H') = 1 - P(e | H'),
which is approximately 0.95 (See Normal Approximation). This may look tricky, but Mayo intends this. Don't take, instead of (d),
(e) P(e | H').
Although this maneuver may seem to work, this does not invalidate our previous counterexample in terms of five alternative hypotheses. For the hypotheses implying 芣H0 are located on both sides of H0; so how can Mayo obtain such a formula as (d) in this case?
Moreover, she seems to admit that when two or more alternative hypotheses are close to each other, there may arise the case where we cannot distinguish them in terms of severity.
if there are alternatives to H that are substantive rivals--one differing merely by a thousandth of a decimal is unlikely to create a substantive rival--and yet they cannot be distinguished on the grounds of severity, then that is grounds for criticizing the experimental test specifications (the test was insufficiently sensitive). It is not grounds for methodological underdetermination. (203)
But what does she mean by "sensitive" here? I can imagine only this: two alternatives give so similar experimental distributions that we can distinguish them only by a long experiment and with a lower (smaller) significance level. This may well apply to our example of "p = 0.50" and "p = 0.52", since an experiment with a small number of trials may fail to distingiush them; but if we make another experiment with more trials, we can choose one hypothesis. This directly leads us to the problem of likelihood principle discussed later in chapters 10 and 11. Notice that even if we make a longer experiment, the likelihood of H0 and that of H' do not change at all. Error statistical approach tries to get out of underdetermination by severity, and its way is to choose a longer (and more sensitive) experiment. When she said that "the test was insufficiently sensitive" in parentheses, and claimed that "it is not grounds for methodological underdetermination", this sort of reasoning was implicitly going on. In a word, "significance", "the dependence on the length of experiment", and "severity" all go hand in hand. (For "likelihood", see Neyman-Pearson, Fisher, Bayes)
Thus, it should be clear that the rejection of the likelihood principle (which says that all the evidential information is contained in the ratio of likelihoods of two hypotheses; see p. 339) comes as a rather direct consequence of SC (I mean, if Mayo spelled out her real intention!). Unfortunately, Mayo does not allow us to see this easily, and keeps us waiting for this until chapters 10 and 11! Why does she try to be so roundabout? But more on this later (including the Bayesian's possible responses).
Corrected June 1, 2001. (c) Soshichi Uchii