No. Only Once In 4,900 Elections Would Chance Alone Produce Such Failures.
Dept. of Psychological Sciences
Abstract: Most of the final 1996 presidential polls predicted that Clinton's margin over Dole would be greater than it actually was. The present document presents the results of a meta-analysis of this pattern; this calculation indicates that chance alone would produce such a failure pattern only once in 4,900 elections.
This critique evoked blunting rejoinders from representatives of the Harris Poll and the Gallup Poll. Humphrey Taylor, Chair and CEO of Louis Harris and Associates, declared that: "The final forecasts of seven of the eight polls Mr. Ladd lists were pretty good. Harris's maximum error on any candidate was 1.8 percentage points. And several polls were more accurate."
Frank Newport, Editor in Chief of the Gallup Poll, offered general remarks along the same lines as Taylor. But Newport went on to make the case even more specific by saying "...the average trial heat percentage for Clinton across all eight polls was 49.88%, the average for Dole was 38%, and the average for Perot was 7.88%. All three of these average sample estimates are within the typical 3% margin of error associated with national polls, and two of the average sample estimates are within less than one point of the actual Election Day population parameter." After further reviewing the numerical details, Newport decided that "...these results represent a striking validation of the accuracy and precision of election polling and the survey research industry."
Clashing viewpoints are not unusual in politics and the discordant opinions quoted above could be therefore be taken as a natural part of the give and take of our public life. I write to suggest that that would be a mistake; it would be incorrect to treat the evaluation of campaign polls as only a matter of mere opinion. Quite the contrary -- an objective assessment of the performance of polling enterprises can be secured because survey research technology derives from and is firmly grounded in the principles of statistical inference. The statistical question can be answered by calculating the odds that bad luck produced the prediction failures cited by Ladd. Such an objective assessment would simply be an appropriate extension of the kind of inference we make when we declare that a coin which comes up heads ten times in ten tosses is suspect. We infer this because the odds are 1,023 to 1 against such an outcome.
However, extending such statistical logic to the data produced by campaign polls requires dealing with the heterogeneity of the techniques used by different pollsters. Polls need to be evaluated using a statistically appropriate measure which would be equally meaningful in any poll regardless of its procedural idiosyncrasies. As will be seen below, merely averaging selected poll data is not appropriate. For such an evaluation to be informative as well, the measure should also have undoubted relevance to poll validity.
Fortunately, the news media's principal use of the polls directly points the way to a measure which satisfies both criteria: the media preoccupation with the horse race aspect of campaigns has always emphasized two questions -- who's ahead and by how much? And only the difference between the two principal party candidates in the polls has generally been of much interest; whether the minor parties' collective share was one or two percent was not. The relevance of a measure based on a given poll's prediction of the difference between Clinton and Dole is therefore obvious. Happily, it is also statistically appropriate because the statistical error of a difference between preferences is roughly proportional to their sum. Therefore, a given difference's relative error would be little affected by the absolute size of the preferences on which it was based.
The first step in an objective analysis is to compute the odds against for each of the eight polls. These odds are called "odds against" because they indicate the likelihood that chance alone can account for the discrepancy between a poll's final pre-election prediction and the actual vote outcome. The higher the odds against, the less likely it is that bad luck will furnish an acceptable explanation for polling failure. A brief summary of the results of such computations can be viewed by hyperlinking to a POLL TABLE. Those interested in full and complete computational detail can find that by hyperlinking to a POLL SPREADSHEET. The data entered into both the table and the spreadsheet were derived from Ladd's previously published Roper Center table mentioned above. However, later and more complete final certified actual vote results are used here instead of Ladd's preliminary numbers.
The essence of the odds computation is simple: one subtracts the actual result of 8.5% from the difference predicted by any given poll and then one compares the value given by this subtraction with the given poll's statistical error.
To illustrate, consider a case of complete poll failure: The CBS/New York Times poll predicted a difference of 18%. This is 9.5% greater than the actual outcome and it is about four times this poll's statistical error of only 2.4%. The odds against such a result are therefore quite large, being about 15,000 to 1. By contrast, the Hotline/Battleground Poll, whose 9% prediction missed by only 0.5%, was quite successful because 0.5% is a small fraction of this poll's error of 2.8% and so the odds are actually in its favor rather than against it -- specifically, the odds are 7 to 1 in favor of the Hotline/Battleground poll's small miss being due to chance.
If the analysis stopped here, these odds might well be read by Gallup's Newport and Harris's Taylor as providing support for their views. After all, the conventional criterion for statistical significance requires odds greater than 19 to 1 and the odds for seven of the eight tabulated polls fall below that criterion. Moreover, the odds are quite modest for six of the polls, at 5 to 1 or less.
But a different assessment is required when one calculates the odds against obtaining the overall pattern of results. The errors made by these polls, with one minor exception, all go in one direction: In Clinton's favor. How likely is it that such a pattern would occur by chance alone? Obviously, intuition suggests this is unlikely. But one can go further than intuition and give an exact answer because methods of quantitating such overall patterns have been developed by a number of scholars over a period of years and are now a generally accepted statistical practice. In particular, it has long been understood by statisticians that repetition increases precision. Hence, despite the comments of Gallup's Newport quoted above, it is well understood that one should not conduct an overall assessment by comparing the average prediction of eight polls with the sampling error of a single poll.
Instead, the proper method of calculating the overall competence of the eight polls taken as a group makes use of what is now commonly called a meta-analysis, which is a name for a family of methods which make it possible to treat the eight polls in ways that are similar to the way pollsters treat their respondents. In both cases, data are collected and statistically analyzed. However, the pollster's datum of interest is the preference of a single person while the datum of interest for a meta-analysis is the report of a single polling organization.
The meta-analytic technique of present interest is one which, taking proper account of the direction of a given poll's results, combines the probabilities obtained in all eight polls to calculate the overall odds against obtaining eight polls which possess the pattern displayed in the table. When this is done, as documented in the POLL SPREADSHEET, the combined data give odds against of about 4,900 to 1. Given the four year interval between presidential campaigns, this meta-analytic quantitation suggests that one would have to run final presidential polls for about 20,000 years in order for chance alone just once to produce a poll failure pattern that is as extreme as the one that occurred in 1996. Written human history, by contrast, only covers about 5,000 years.
There can then be little doubt that calculation has confirmed the intuition that the lopsided outcomes of the 1996 presidential preference polls made them a collective failure. It is equally true, however, that no conclusion whatsoever can be drawn from the present quantitative analysis about the exact nature of the bias which excessively favored Clinton over Dole.
It should be noted that earlier drafts of the present analysis were criticized because random sampling was assumed in calculating the statistical errors. This criticism correctly noted that the sampling techniques commonly used by pollsters are not purely random, and it further noted that departures from random sampling may affect the size of the statistical error. A good faith criticism would also have noted that certain deviations from random sampling will decrease sampling errors  which would actually strengthen the conclusions offered above. Instead the criticism rigidly focused on the fact that certain other deviations from random sampling would increase statistical sampling errors, which would weaken the present conclusions. Now exactly correct sampling error values could be calculated if polling designs were fully disclosed in a timely fashion; unfortunately, they are not. Nevertheless, the random sampling assumption used here is quite justified because polling organizations actually publish their own estimates of the statistical errors which affect their own polls. Any fair-minded observer will readily determine that these published values are about the same as the values one would calculate assuming random sampling. Nevertheless, it may well be that the odds calculated here are either somewhat too small or somewhat too large. Hence, they should be viewed as tentative and subject to future revisions which could go in either direction. But there is no reason to expect that these revisions will be substantial enough to alter the conclusion presented here.
Even though this quantitation may not be final, some perspective may come from calculating that the odds against a fair coin producing twelve heads in a row are comparable to the present combined polling odds estimate, being 4,095 to 1. To continue to use current polling technology without calling for a change would put our political processes in the position of a gambler who continues to play after the roulette wheel comes up red twelve times in a row. In both cases, it might be just chance, but any sensible person would stop and check the apparatus before going on.
So 1996 really was a terrible year for the polls. The matter is critical because no-one can doubt that the polls influenced last year's press coverage. Further, there can be little doubt of the important impact of the polls on the several campaigns' strategies. Of gravest concern is the possibility that the electoral result itself and with it, the basic processes of our democracy, were both distorted last year by a technology which failed.
Version 8.4 - 5 August 1997
© 1997 Gerald S. Wasserman
[1-return] The Pollster's Waterloo, Wall Street Journal, 19 November 1996.
[2-return] Polls: By Our Reckoning, We Did Fine. Wall Street Journal, 11 December 1996.
[3-return] See: L. Kish, Survey Sampling, 1965, Wiley, pp. 497-501.
[4-return] See the Politics Now website at: http://www.politicsnow.com/campaign/wh_house/finalvote/
[5-return] See R. Rosenthal, Meta-Analytic Procedures for Social Research, Rev. Ed., 1991, Sage Publications, Ch. 5.
[6-return] A preliminary draft of this article was sent to Everett Carll Ladd. He circulated copies of it within the polling community. As a result, a fairly accurate and complete summary of these ideas appeared in a newspaper story (see R. Morin, The Election Post-Mortem: The Experts Debate The Accuracy Of Their Surveys, The Washington Post National Weekly Edition, 13 January 1997). It may be noted that the odds reported in that newspaper story differed a bit from the ones presented here because the preliminary draft used preliminary actual vote counts.
Both constructive and destructive criticisms of the draft were offered as a result. I am particularly indebted to Nick Panagakis of Chicago's Market Shares Corporation for collegial instruction about potential sources of bias in pre-election polls. Also helpful was another commentator's rigid and unnecessarily intemperate critique of the assumption of random sampling.
[7-return] See: L. Kish, Survey Sampling, 1965, Wiley, passim.
[8-return] If this criticism is made in a principled fashion, one soon finds that it creates a dilemma because any attempt to impeach the present meta-analysis by postulating an inflation of sampling error will also reduce the putative utility of opinion polls themselves. An interesting exercise is to use the spreadsheet given here to calculate how much the sampling error would have to be inflated in order to bring the present meta-analytically combined odds below the conventional cutoff for statistical significance. A principled critic would then also have to estimate what the exact same degree of inflation would do to the confidence intervals of commercial polls. The answer is that these intervals would be well into the double digit range of percentages. And who would pay for a poll that could only predict landslides?