Task Force on Statistical Inference Identifies Charge and Produces Report

Task Force on Statistical Inference
Robert Abelson, PhD (Co-Chair)*
Robert Rosenthal, PhD (Co-Chair)
Jacob Cohen, PhD (Co-Chair)
Leona S. Aiken, PhD
Mark Appelbaum, PhD
Gwyneth M. Boodoo, PhD
David A. Kenny, PhD
Helena C. Kraemer, PhD
Donald B. Rubin, PhD
Bruce Thompson, PhD*
Howard Wainer, PhD
Leland Wilkinson, PhD

APA Staff
Christine R. Hartel, PhD
Sangeeta Panicker, Liaison

* Task force members not present at meeting.

At its first meeting, held December 14-15, 1996, the Task Force on Statistical Inference (TFSI), which was formed by the Board of Scientific Affairs in 1996, identified its charge as being broadly focused on assessing current practices in the analysis of psychological data.

The following is a preliminary report, produced at the task force's December meeting. TFSI welcomes comments on the report from all interested parties. Additional copies can be obtained by contacting the Science Directorate or by going to the Science Directorate Web page (http://www.apa.org/science/tfsi.html). The deadline for receiving comments is May 30, 1997. Please send comments to:

Sangy Panicker
Staff Liaison to TFSI
Science Directorate
750 First Street, NE
Washington, DC 20002-4242

Report of the Task Force on Statistical Inference

This draft reflects the initial deliberations of the task force and the first of its recommendations to the Board of Scientific Affairs. The task force addresses two issues. First, it considers the issue that brought the task force into existence, namely the role of null hypothesis significance testing in psychological research. Second, it considers modifications to current practice in the quantitative treatment of data in the science of psychology.

Null Hypothesis Significance Testing

Many have assumed that the charge of this task force is narrowly focused on the issue of null hypothesis significance testing and, particularly, on the use of the p value. The charge this task force has accepted, however, is broader. It is the view of the task force that there are many ways of using statistical methods to help us understand the phenomena we are studying (e.g., Bayesian methods, graphical and exploratory data analysis methods, and hypothesis testing strategies). We endorse a policy of inclusiveness that allows any procedure that appropriately sheds light on the phenomenon of interest to be included in the arsenal of the research scientist. In this spirit, the task force does not support any action that could be interpreted as banning the use of null hypothesis significance testing or p values in psychological research and publications.

Broader Topics of Recommendations

Four broad topics of quantitative treatment of research data in which the task force believes major improvements in current practice could and should be made were identified at this meeting. These topics are (1) approaches to enhance the quality of data usage and to protect against potential misinterpretation of quantitative results, (2) the need for theory-generating studies, (3) the use of minimally sufficient designs and analytic strategies, and (4) issues with computerized data analyses.

(1) Approaches to enhance the quality of data usage and to protect against potential misrepresentation of quantitative results.

Of these four topics, the first has so far received the greatest attention from the task force. With respect to this topic, the task force has identified three issues that are particularly germane to current practice.

* More extensive descriptions of the data should be provided to reviewers and readers. This should include means, standard deviations, sample sizes, five-point summaries, box-and-whisker plots, other graphics, and descriptions related to missing data as appropriate.

* Enhanced characterization of the results of analyses (beyond simple p value statements) to include both direction and size of effect (e.g., mean difference, regression and correlation coefficients, odds-ratios, and more complex effect size indicators) and their confidence intervals should be provided routinely as part of the presentation. These characterizations should be reported in the most interpretable metric (e.g., the expected unit change in the criterion for a unit change in the predictor, Cohen's d).

* The use of techniques to ensure that the reported results are not produced by anomalies in the data (e.g., outliers, points of high influence, nonrandom missing data, selection, and attrition problems) should be a standard component of all analyses. (2) The need for theory-generating studies.

In its recent history, psychology has been dominated by the hypothetico-deductive approach. It is the view of the task force that researchers have too often been forced into the premature formulation of theoretical models in order to have their work funded or published. The premature formulation of theoretical models has often led to the worst problems seen in the use of null hypothesis testing, such as misrepresentation of exploratory results as confirmatory studies, or poor design of confirmatory studies in the absence of necessary exploratory results. We propose the field become more open to well-formulated and well-conducted exploratory studies with the appropriate quantitative treatment of their results, thereby enhancing the quality and utility of future theory generation and assessment. (3) The use of minimally sufficient designs and analytic strategies.

The wide array of quantitative techniques and the vast number of designs available to address research questions leave the researcher with the nontrivial task of matching analysis and design to the research question. Many forces (including reviewers of grants and papers, journal editors, and dissertation advisors) compel researchers to select increasingly complex (e.g., state-of-the-art and cutting edge) analytic and design strategies. Sometimes such complex designs and analytic strategies are necessary to address research questions effectively; it is also true that simpler approaches can provide elegant answers to important questions. It is the recommendation of the task force that the principle of parsimony be applied to the selection of designs and analyses. The minimally sufficient design and analysis is typically to be preferred because of the following:

* It is often based on the fewest and least restrictive assumptions.

* Its use is less prone to errors of application, and errors are more easily recognized.

* Its results are easier to communicate to both the scientific and lay communities. This is not to say that new advances in both design and analysis are not needed, but simply that newer is not necessarily better and that more complex is not necessarily preferable. (4) Issues with computerized data analysis.

Elegant and sophisticated computer programs have increased our ability to analyze data with substantially greater sophistication than was possible only a short time ago. The ease of access to state-of-the-art statistical analysis packages, however, has not universally advanced our science. Common misuses of computerized data analysis include the following:

* Reporting statistics without understanding how they are computed or what they mean.

* Relying on results without regard to their reasonableness or without verification by independent computation.

* Reporting results to greater precision than supported by the data, simply because they are printed by the program. The task force encourages efforts to avoid the sanctification of computerized data analysis. Computer programs have placed a much greater demand on researchers to understand and control their analysis and design choices.