Publication: Journal of Instructional Psychology
Date published:
Language: English
PMID: 18133
ISSN: 00941956
Journal code: JISP

Achievement tests attempt to measure what an individual has learned his or her present level of performance. They are used in diagnosing strengths and weaknesses and as a basis for awarding prizes, scholarships , or degrees. Many of the achievement tests used in schools are nonstandard zed, teacher-designed tests (Best & Kahn, 2006, p. 301).

Multiple-choice and true-false tests have a valuable role in higher education (Burton 2005, p. 65). When we think of measuring achievement, the natural tendency is to think of published standardized, multiple choices, true-false, or fill-in tests, all which rely on written test items that are read and answered correctly or incorrectly by the examinee (S tiggins ,1992, p. 211). It is much discussed fact that marks in multiple choice and truefalse tests may be obtained by guessing. However, the actual extent to which chance affects scores thereby is too little appreciated (Burton & Miller, 1999, p. 399). The assessment of student learning is an important issue for educators (Madaus & O'Dwyer, 1999). Since the early 190Os, traditional multiple choice (MC) item formats have achieved a position of dominance in learning assessment, mainly due to the prima facie objectivity and the efficiency of administration this format represents. However, the popularity of the MC format has come under scrutiny for some applications where accuracy of assessment, particularly for complex knowledge domains, has greater importance than efficiency (B ecker& Johnston, 1999; Bennett,Rock,& Wang, 1991). Traditional MC testing formats offer efficiency, objectivity, simplicity, and ease of use for the assessment of student knowledge, but are subject to many sources of interpretation error (Swartz, 2006, p. 215).

According to Traub (1991) no measurement is perfect. Chance may affect scores in multiple choice and true-false tests in two ways. First, if the questions sample only part of the examinable subject matter, then a particular examinee may be lucky or unlucky in the examiner's choice of questions (Posey, 1932). Second, marks may be obtained by guessing. Test reliability is discussed in many textbooks. But few give the reader a quantitative feeling for the inherent unreliability of particular tests that is due to chance (Burton , 200 1 , ? . 42) . Although number-right scores can undoubtedly be considerably raised by guessing, it is a common belief that guessing is unimportant, or at least guessing that is completely random ('blind'). When multiple-choice tests began to widely used, they were critized because examinees could answer correctly by guessing . Many educators viewed any score gain from guessing as ill gotten (Fray, 1 988 ; Burton , 2004) . True-false tests are limited to testing for factual recall in general opinions of educators. But, Ebel (1979) demonstrates clearly that true-false items can be made to present quite difficult and complex problems. This applies to multiplechoice items too (Burton 2005, p. 66).

The reliability or stability of a test is usually expressed as acorrelation coefficient. There are a number of types of reliability: Stability over time, stability over item samples, stability of items, stability over scores, stability over testers, and standard error of measurement. The reliability of a test may be raised by increasing the number of items of equal quality to the other items (Best & Kahn, 2006; Traub, 1991; Harvill, 1991; Burton, 2002; Burton & Miller, 1999). The reliability coefficient for a set of scores from a group of examinees is the coefficient of correlation between that set of scores and another set of scores on an equivalent test obtained independently from the members of the same group (Ebel & Frisbie, 1991, p. 71).

Toppino and Brochin'(1989) findings indicated that exposure to statement on a true-false test increased student's tendency to believe that the statement was true regardless of whether the statement actually was true or false. In Jandaghi and Shateria, (2008) study, they found that the designed exams have suitable coefficients of validity and reliability. The level of difficulty of exams was high. No significant relationship was found between male and female teachers in terms of the coefficient of validity and reliability but a significant difference between the difficulty level in male and female teachers was found.

Based on what is said, an ideal test in addition to measuring what is supposed to measure must be consistently constant in different times. This characteristic is called reliability. Other measures of an ideal test are difficulty level and discriminant index. Bruce ( 1 974) proposed a difficulty coefficient based on the assumption that increasing test difficulty results in increased variability of test at or above the mean divided by the mean test score. A higher difficulty coefficient indicates a more difficult test scores and a lower mean , the difficulty coefficient is equal to divided by the mean test score Stowell (2003). The total percent of the individuals who answer the question correctly is known as difficulty coefficient denoted by P Seif (2004). The discriminant index is a measure of discrimination between strong and weak groups (Jandaghi & Shateria, 2008).

In their findings, Mehrens and Lehman ( 1 984) claimed that true-false items "are good for young children and/or pupils who are poor readers" . Because true-false require relatively few words, they are "useful in special situations" , such as testing primary-grade children , and poor readers (Hopkins &Antes, 1985, p. 135). About to how effective are true-false, relative to multiple-choice, for measuring learning outcomes that require higher-order thinking, and it is said that true-false items are "not especially useful beyond the knowledge area" (Gronlund & Linn, 1990 p. 153). So Ahmann and Clock (1981) concurred in their study: "The true-false item serves best as a means of measuring the student's ability to recall information". On the other hand, an another study drew the opposite conclusion : "Although writing true-false items to measure higher mental processes can be both difficult and time consuming, it is less difficult to write a good true-false item than to write a good four-response multiple-choice item" (Mehrens & Lehmarm 1984, p. 146).

The arithmetic average, standard deviation, average difficulty, reliability and validity of test occupy an important place in classical test theory (Gelbal, 1994, p. 85). This research is restricted to whether or not there is a statistically significant difference between Multiple-choice tests and True-False tests in terms of averages, standard deviations, item difficulty indices, and items' discriminating properties; and students' answers to the tests as well as behaviors that the test questions measure.

Mostly Multiple-choice tests are preferred especially in general-purpose exams in educational systems. This practice stems from the belief that Multiple-choice tests usually have higher ability of measuring than True-False tests. Gelbal (1 994) suggests that multiple-choice tests are more superior to other techniques in validity and reliability. For this reason, True-False tests are not observed as measurement techniques used in general-purpose exams.

This study aims at comparing the difficulty levels, distinctive powers and powers of testing achievement of multiple choice tests and true-false tests, and thus revealing the lightness or wrongness of the commonly believed hypothesis that multiple choice tests don't bear the same properties as true-false tests. The research findings are important in that they demonstrate the consequences of a comparison of multiple choice and truefalse tests, and that they will contribute to educators' or appliers' making more realistic preferences in choosing and using the techniques.


Research Design

This is a descriptive research describes existing conditions without analyzing relationship among variables study (Frankel, 2000, p. 581). A descriptive study describes and interprets what is. It is concerned with conditions or relationships that exist, opinions that are held, processes that are going on effects that are evident, or trends that are developing. It is primarily concerned with the present, although it often considers past events and influences as they relate to current conditions (Best & Kahn, 2006, p. 118). Descriptive research describes the characteristics of an existing phenomenon (Salkind, 2000, p. 1 1). Not only can descriptive research stand on its own, it can also serve as basis for other types of research, in that a group's characteristics often need to be described before the meaningfulness of any differences can be addressed (Salkind, 2000, p.ll).

Data Collection Instrument

The research data were obtained through a 100- item test of 50 multiple-choice and 50 True-False questions, which was given at the end semesters- when identical objectives , strategies, methods and techniques of teaching were employed. Apart from that, since the data were used as the students' final exam results, the students displayed their maximum performance in the process of measurement. Developing the data collecting instrument , expert opinions were included in the scope of questions in terms of whether the creation of parallel structures and processes to ensure content-related validity of the test substances (Turgut, 1993, Tekin, 2007, Karasar, 2007, Frankel & Wallen, 2000, Wiersma&Jurs,2005).

There are basically two approaches to determining the validity of an instrument. One is through a logical analysis of content or logical analysis of what make up an educational trait, construct, or features. This is a judgmental analysis. The other approach, through an empirical analysis, uses criterion measurement , the criterion being some sort of standard or desired outcome (Turgut, 1993, Tekin, 2007, Karasar, 2007, Frankel & Wallen, 2000, Wiersma & Jurs, 2005). In this study logical analysis approach was selected and applied to determine the validity.

Data Analysis

In the items analysis and reliability analysis of Academic Achievement Test, SPSS was employed, and the analyses were performed in line with the following:

1. The internal coherence coefficient for the overall test was tested with both K-R 20 and Spearman-Brown test. The Kuder-Richardson procedures (K-R 20 and K-R 21) are applicable to binary data (Wiersma & Jurs, 2005, p.325). Generally, items with reliability below 0.5 cannot be said to be reliable items. (Glass & Hopkins, 1996; Best & Kahn, 2006; Karasar, 2007). On the contrary; a low reliability coefficient does not necessarily mean that the test is of poor quality, and in some circumstances a poor test might high reliability (Traub, 1991, p. 178).

2. Each correct answer was assigned 1 point in scoring Academic Achievement Test in the process of items analysis . No points were assigned to the incorrect answers or to skipped questions. Thus, scores that an individual attains in a test constitutes the number of items he has answered correctly. Generally, in literature test item analysis process, 100-200 of the total 27% rates are ideal (Turgut, 1993, Salkind,2000,Tekin,2007).In this study, group occupies these features (n: 252).

a. The formula of d= (NC^sub h^ - NC^sub 1^) /(.5)T was used in calculating the discrimination power (d) of the test applied. Items analysis was carried over on the basis of correlation, and items which were >.19 were classified as items with a good distinctive power.

b. In the calculation of the item difficulty indices (D) of the test applied, the formula D = (NC^sub h^-NC^sub 1^)/ T was used. The obtained P value, then, was used as the criterion to evaluate the items; and items with an item difficulty of .50 were considered appropriate as vehicles of measurement (Salkind, 2000, 129).

3. The independent t-test was used in the comparison of achievement on the basis of sex. The test of the significance of the difference between two means is known as a t test (Best & Kahn, 2006, p. 406-407).

4. Paired-t test was used in determining whether or not there was a statistically significant difference between the identical answer averages of the multiple-choice and true-false questions constituting the test.


In this study findings are explaining fallowing tables 1 to 4.

Table 1 indicates that the calculated values of Academic Achievement Test high reliability level.

According to Table 2, there is not a statistically significant difference between the averages of multiple-choice and truefalse tests given in the academic year on the basis of sex. Because of the scores is not a statistically different we will be say MC and TF tests have enough predictive-related validity of instruments.

According to Table 3 , there is a statistically significant difference between the answer averages of the 28 items in terms of t values that are calculated according to the multiple-choice and true-false test results of parallel structure which were applied in the academic year. Whereas the average of 13 items had a difference in favor of the multiple-choice, the average of 15 items were in favor of the true-false items. On the other hand, the average of 22 items was not found statistically significant.

In the Table 3 findings, 23 items had a greater average in favor of multiple-choice items, but 27 items had a greater average in favor of true-false items.

According to Table 4, the discriminating indices and difficulty levels of parallel structure multiple-choice and true-false test items were at the same level, and the t values found for each pair of the items were not statistically significant.

On examining the difficulty features of the test items intended to measure the same properties (1-P > 0.50: difficult item), it was found that 10 out of 50 multiple-choice items were difficult and 40 were easy. On the other hand, 5 of the true-false items were difficult; 45 were easy.

On examining the discriminating properties of the test items intended to measure the same features, it was found that 10 of the 50 multiple-choice items were weak, 40 were normal; 14 of the true-false items were weak and 36 were normal.

Conclusion and Discussion

The calculations following the application and analysis of multiple-choice and truefalse tests which had been prepared equally showed that reliability was high (Table 1 ) . As is pointed out in Burton and Miller (1999), the fact that the research scale included 50 items of each test type affected test reliability in positive way; which is a result supporting Jandaghi and Shateria's (2008) findings. And both MC and TF test scores show that they have same level predictive-related validity. To understand and use the standard error of measurement requires understanding the basic concepts of test reliability, such as true scores and measurement error (Harvill, 1991, p. 181).

There were no statistically significant differences between the multiple-choice and true-false test averages on the basis of sex (Table 2). This result is supported with Jandaghi and Shateria, (2008), Burton and Miller (1999), Burton (2000, 2001, Frisbe and Becker (1991) findings.

The difference between the averages of answers to the 28 test items was statistically significant. Whereas the average of 13 items had a difference in favor of multiplechoice items, the average of 15 items had a difference in favor of true-false items. Yet, the difference between the averages of 22 items had no statistical significance (Table 3). In other words, those items were able to measure the intended properties at the same level in both multiple-choice and true-false type of questions.

Those results demonstrate that multiplechoice tests are not superior to true-false ones in uncovering students' achievement compared to true-false tests; neither are true-false tests when compared to multiplechoice ones.

The 24 of the parallel structure multiplechoice and true-false items that were applied in the academic y ear were at the same Ie vel in terms of discriminating indices and difficulty levels, and the t values calculated for each pair of the items demonstrated no statistical significance (Table 4).

On examining the difficulty features of the test items intended to measure the same properties, it was found that neither type of test was more difficult or easier than the other. As was stated in Stowell (2003) and Seif (2004), these two types of tests were not composed of high difficulty items. These results showed that the students' answers were similar in both tests, and that the test technique employed did not bring advantages as more difficult or easier in testing students' achievement.

An examination of the discriminating properties of the test items intended to measure the same features showed that neither test displayed a distribution different from the other in terms of discriminating indices. About the how much easier is the typical classroom true-false test than the typical classroom multiple-choice test of comparable content? Frisbie and Becker (1991) claimed that no research in the measurement literature supports the notion that true-false tests are easy, in the absolute sense, or easier than comparable multiple choice. And they suggested that there may simply be a need to re-examine the lore that causes many test developers to avoid true-false items. A survey of classroom testing practices would provide answers to the question of popularity . The results support Jandaghi and Shateria's (2008) findings.

Generally speaking, the research results are significant in that they show that true- false tests are not easier than multiple-choice ones , and that probable success stemming from the test structure does not contribute significantly to the difference in general.


True-False tests can be employed as an important alternative to multiple-choice tests in schools and in exams for description, education or level measurement in schools or on other occasions, on the contrary to the general practice and belief. In general, MC and TF test scores may include chance points because of the structural features but TF tests are more open to this due to their structures. However, in this study, when TF and MC scores were compared , the significant difference was not in favor of TF.

This result suggests that, if it was used in accordance with the principles , TF may be an important alternative to MC.


Ahmann, J.S.,& Glock, M.D. (1981). Evaluating student progress: Principles of tests andmeasurement (6th ed.). Boston: Allyn & Bacon

Becker,W. E.,& Johnston, C.(1999). The relationship between multiple choice and essay response questions in assessing economics understanding. The Economic Record, 75, 348-357.

Bennett, R. E., Rock, D. A.,& Wang, M. (1991). Equivalence of free-response and multiple choice items. Journal of Educational Measurement, 28(1), 77-92.

Best,W.J., & Kahn,V.J.(2006).AeieorcAmedMcation (11th ed.). Pearson Education Inc.

Bruce P. H., (1974). An index of test difficulty, with applications. NZJ Educ Stud., 9,31-41.

Burton, F. B.,& Miller, D. J. (1999). Statistical modeling of multiple-choice and true-false tests: Ways of considering, and of reducing, the uncertainties attributable of guessing. Assessment & Evaluation in Higher Education, 24(4), 399-411.

Burton, F. R. (2001). Quantifying the effects of chance in multiple choice and true-false tests: Question selection and guessing of answers. Assessment & Evaluation in Higher Education, 26(1).

Burton, F. R. (2002). Misinformation, partial knowledge and guessing in true-false tests. Medical Education, 36, 805-811.

Burton, F. R. (2004) . Multiple choice and true-false tests: reliability measures and some implications of negative marking. Assessment & Evaluation in Higher Education, 29( 5).

Burton, F. R. (2005). Multiple-choice and truefalse tests: myths and misapprehensions. Assessment & Evaluation in Higher Education, 30(1), 65-72.

Ebel, K.L., & Frisbie, D. A. (1991). Essentials of educational measurement (5th ed.) Englowod Cliffis, NJ: Printive-Hall.

Frankel, J. R., & Wallen, N. E. (2000). Exploring Research (4th ed.), Prentice Hall.

Fray, B. R. (1988). An NCME instructional module on formula scoring of multiple choice tests (correction for guessing). Instructional Topic in Educational Measurement, Module 4, summer, 75-81.

Frisbie, D.A., & Becker, D. F.(1991).Ananalysisof textbook advice about true-false tests . Applied Measurement in Education, 4(1), 67-83.

Frisbie, D. A.(1988). An NCME instructional module on reliability of scores from teachermade tests . Instructional Topic in Educational Measurement, Module 3, spring, 55-65.

Gelbal, S. (1994). p madde giicluk indeksi Ue rash modelinin b paremetresi ve bunlara dayali yetenek ölcütleri üzerine bir kars. lias, tirma [A comparution of p item difficulty indies with rash model parameter ](Hacettepe Üniversitesi Egitim Fakültesi Dergisi [Journal of Hacettepe University Faculty of Education], 10, 85-94.

Glass, G. V.,& Hopkins, K.D. (1996). Statistical methods in education and psychology (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall.

GronlundJSLE., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th ed.). New York: Macmilan.

Harvill, L. M. (1991). An NCME instructional module on standard error of measurement. Instructional Topic in Educational Measurement, module 9, summer, 181-190.

Hopkins, C.D.,& Antes, R. L. (1985). Classroom measurement & evaluation (2* ed.), !tasca, IL:Peacock.

Jandaghi, G.,& Shateria, F. (2008). Rate of validity, reliability and difficulty indices for teacher-designed exam questions in first year high school . International Journal of Human Sciences, 5(2).

Karasar, N. (2007). Bilimsel aras.tirma yöntemleri [Scientific research methods](17k ed.). Ankara:Nobel Yayincihk.

Kuder, G. F.,& Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika,2(3), 151-160.

Madaus, G. F.,& O'Dwyer, L. M. (1999). A short history of performance assessment: Lessons learned. Phi Delta Kappan, 80, 688-695.

Mehrens, W. A., & Lehmann, I. J. (1984). Measurement and evaluation in education and psychology (3rd ed.). New York: Holt, Rinehart & Winston.

Posey, C. (1932). Luck and examination grades. Journal of Engineering Education, 60, 292-296.

Salkind,N.J.(2000).Exploring research (4th ed.). Prentice-Hall, Inc.

Stiggins, R. J. (1992). An NCME instructional module on design and development of performance assessments. Instructional Topic in Educational Measurement, Module 12, spring, 211-215.

Swartz, S. M. (2006). Acceptance and accuracy of multiple choice , confidence-level , and essay question formats for graduate students. Journal of Education for Business, March/April; 215-220.

Tekin, H. (2007). Egitimde olçme degerlendirme (18th ed.).[Measurement and evaluation in teaching]. Ankara: Yargi Yaymevi

Thorndike , R .M .( 1 997) . Measurement and evaluation in psychology and education (6th ed.). Columbus, OH: Merrill.

Toppino, T.C., & Brochin, H. A. (1989). Learning from tests: the case of true-false examinations. Journal of Educational Research, November/December, 83(2), 119-124.

Traub, R.E., & Rowley, G. L. (1991) An NCME instructional module on understanding reliability. Instructional Topic in Educational Measurement, Module 3, spring, 171-175.

Turgut, F. (1993). Egitimde ölcme degerlendirme (9th ed.) [Measurement and evaluation in teaching J. Ankara: Saydam Matbaacilik.

Wang, J. (1995). Critical values of guessing on true-false and multiple-choice tests. Education, 116(1), 153-158.

Wiersma, W., & Jurs, S. G. (2005). Research methods in education (8th ed.). Pearson Education Inc.

Author affiliation:

Mehmet Tasdemir, PhD. Ahi Evran University, Faculty of Education, Department of Education, 40100 Kirsehir, Turkey.

Correspondence concerning this article should be addressed to Dr. Mehmet Tasdermir at or

The use of this website is subject to the following Terms of Use