Content area
Computer technologies have opened up new possibilities for optimizing the administration of tests, test development and assessment. Computer Aided/Assisted Testing, as well as Computer Assisted Language Testing, have brought many positive aspects that can be applied in order to create a more positive attitude toward test assessment, to reduce item exposure and subsequent security risks, and provide a valid and reliable measurement of students' competence. Nowadays, teachers have a choice between supervised or unsupervised e-tests, and quite often they vote for unsupervised tests as these tests allow frequent testing of many students with fewer teaching staff, and give each student the significant freedom of choosing the time, place and manner in which to take the test. Teachers who prefer classical methods of teaching insist on paper-and-pen tests. The authors describe their experience with all of the above mentioned types of tests, and then focus on their research dealing with long-term observation of students' results and statistical assessment of their performance in English language. In this article, students' results of e-test tests and paper-and-pen tests (supervised and unsupervised) are compared in order to find any relationships among them, and to find an optimal proportion among various types of tests. Applied statistics have been applied to gain valid data when analysing different types of tests and comparing the results of these tests. Parametric and non-parametric statistical tests of the hypothesis regarding these relationships are described in the last part of this article.
Abstract: Computer technologies have opened up new possibilities for optimizing the administration of tests, test development and assessment. Computer Aided/Assisted Testing, as well as Computer Assisted Language Testing, have brought many positive aspects that can be applied in order to create a more positive attitude toward test assessment, to reduce item exposure and subsequent security risks, and provide a valid and reliable measurement of students' competence. Nowadays, teachers have a choice between supervised or unsupervised e-tests, and quite often they vote for unsupervised tests as these tests allow frequent testing of many students with fewer teaching staff, and give each student the significant freedom of choosing the time, place and manner in which to take the test. Teachers who prefer classical methods of teaching insist on paper-and-pen tests. The authors describe their experience with all of the above mentioned types of tests, and then focus on their research dealing with long-term observation of students' results and statistical assessment of their performance in English language. In this article, students' results of e-test tests and paper-and-pen tests (supervised and unsupervised) are compared in order to find any relationships among them, and to find an optimal proportion among various types of tests. Applied statistics have been applied to gain valid data when analysing different types of tests and comparing the results of these tests. Parametric and non-parametric statistical tests of the hypothesis regarding these relationships are described in the last part of this article.
Keywords: computer assisted testing, supervised and unsupervised e-tests, paper-and-pencil tests, applied statistics, statistical tests
1. Introduction
Information and communication technologies have become widespread all around the world. Moreover, in the field of language testing they have been instrumental in leading to innovation in language testing. As a result, students can take computerized tests such as CBT (Computer Based Test), CBELT (Computer Based English Language Test), CAT (Computer Aided/Assisted or Computer Adaptive Test), CALT (Computer Adaptive Language Testing), which can be used for testing purposes (Bennett, 2002; Pommerich, 2004; Cechova, 2014; Rezaie and Golshan, 2015).
Today, computer technologies in the field of language learning, teaching, assessment, and testing, have become so widespread that they are regarded as an inseparable part of the education system at basic and secondary schools and universities. José Noijons says that "CALT is "an integrated procedure in which language performance is elicited and assessed with the help of a computer". In its purest form, three integrated procedures can be distinguished which relate to the following processes: generating the test, interaction with candidate, and evaluation and responses (Noijons, 1994). Laghos defines CALT as "an integrated procedure in which language performance is elicited and assessed with the help of a computer" (Laghos, 2009). Chapelle writes about three main motives for using technology in language testing: efficiency, equivalence, and innovation. Efficiency is achieved through computer-adaptive testing and analysis-based assessment that utilizes Automated Writing Evaluation (AWE) or Automated Speech Evaluation (ASE) systems. Equivalence refers to research on making computerized tests equivalent to the paper and pencil tests which are considered to be "the gold standard" in language testing. Innovation-where technology can create a true transformation of language testing-is revealed in the reconceptualization of the L2 ability construct in CALT as "the ability to select and deploy appropriate language through the technologies that are appropriate for a situation" (Chapelle & Douglas, 2006).
The nature of CALT has many positive aspects. According to Pathan the use of computer technology in the field of language assessment and testing falls under three major domains due to the nature of use of this technology. These include:
* use of computer for generating tests automatically;
* interaction of computer with the candidate (in the form of online interaction);
* use of computer for the evaluation of test taker's responses (Pathan, 2012).
The advantages of CALT are well known and apparent. Pathan has mentioned the following: overcoming administrative and logistic burdens, consistency and uniformity, better authenticity and greater interaction, self-pacing, more positive attitude toward tests, individualized testing environments (Pathan, 2012). CALT also helps test developers to set the same test conditions for all participants, as well as to improve all aspects of test security by storing questions and responses in databases, and enables testers to create randomized questions and answers from vast question databanks. The use of CALT also decreases the amount of time needed for test preparation and marking and enables to track a student's behaviour (e.g. time spent on one test item or section, corrections made, etc.).
On the other hand, many researchers (Canale 1986, Lange 1990; Tung 1986, Alderson, 2000) have shown the limitations and pitfalls of the use of CALT in the field of language assessment/testing. Alderson emphasizes that examinees need computer literacy in order to eliminate the mode effect on computer-based testing (Alderson, 2000). According to other language test specialists, CALT requires for example, equipped computer labs, a large bank of test items, and all test items must measure the same single trait. Moreover, open ended questions are not usually presented in computerized formats because these kinds of questions are usually scored by teachers.
Students especially appreciate the lack of time limits and being able to work without supervision of testers or teachers, which can be stressful. They can be given as much time as they need to finish a test, and unsupervised tests allow students to work at their own pace, to choose time and place (i.e. a less formal environment) within the test limit. Teachers emphasize easy test distribution without the necessity to copy hundreds of papers and especially easy assessment, without hours spent on marking tests.
2. Computer assisted testing versus traditional testing
One of the first debates concerning the equivalence of CALT (e-testing) and traditional testing (pen and paperbased tests) was published in 1992 by Dillon, who wrote a critical review of the empirical literature on reading from paper versus screen. According to Dillon's outcomes it is not possible "to achieve total equivalence" between them (Dillon, 1992). Dillon wrote that reading was some 20 to 30% slower (in terms of proof-reading performance) from a computer screen than from paper (Dillon, 1994). Noyes and Garland in their study confirm that early studies focusing on comparisons of computer- and paper-based tasks generally favoured paper for better performance according to the metrics of speed, accuracy and comprehension (Noyes, Garland, 2008). How students perform on computer-delivered tests depends, in part, on how familiar they are with the technology, concludes a set of studies conducted by Princeton, N.J. Results showed that average scores on computer-based writing tests generally were not significantly different from average scores on paper-based exams. But, as with math tests, individual students with better hands-on computer skills tended to achieve higher online scores, after adjusting for their level of paper writing skills (Olson). Karadeniz studied the impact of paper based, web based, and mobile based assessment on students' achievement. In his study he proves that students had positive attitudes towards web based and mobile based assessment due to ease of use, and comprehensive and instant feedback. Moreover, the most favoured tests were web based, and the least favoured were paper based (Karadeniz, 2009).
3. English language at the University of Defence
Knowledge of foreign languages helps reduce language barriers and is essential for increasing individuals' mobility both in their personal and professional lives. It is one of the reasons the University of Defence (UoD) emphasizes studying foreign languages, especially English language. Another reason is that the Czech Army belongs to the NATO structures. English is undoubtedly a priority as it is a mandatory subject for all students at the UoD. However, all students have to study two foreign languages. In addition to the obligatory English, they can choose German as a second language. Language education at the UoD respects the standard of NATO STANAG 6001.
NATO STANAG (Standardisation Agreement), is an international military standard created by the North Atlantic Treaty Organisation (NATO) for regulating equipment, procedures, tactics, training and just about everything affecting the way armed forces from different countries work together on operations and exercises. STANAG 6001 is a language proficiency scale designed to allow comparisons of language ability in different countries. The scale consists of a set of descriptors with proficiency skills broken down into six levels/SLP (Standardized Language Profile). They are defined as follows:
* Level 0: No proficiency
* Level 1: Survival
* Level 2: Functional
* Level 3: Professional
* Level 4: Expert
* Level 5. Highly-articulate native.
Each level category contains tests from all four skills, always in the same order as they appear on STANAG language certificates: Listening Comprehension, Speaking, Reading Comprehension, and Writing. UoD students must pass Stanag SLP 2222 by the end of the fifth semester.
UoD students are obliged to pass NATO STANA 6001 SLP 2222 by the end of the fifth semester but the number of English lessons is limited by accredited study programmes, and this number is not very high. This is one reason why teachers are trying to replace pen-and-paper tests with e-tests, and even in some cases unsupervised e-tests. Unsupervised tests were chosen in order to save time for regular face to face lessons. Unsupervised tests enable teachers easier, but nevertheless safe, distribution of tests, and immediate assessment and feedback.
At the beginning of the first semester all students are tested in order to create homogeneous, groups so that students of similar levels of are placed together, working on materials suited to their particular level. Placement tests are carried out in the form of unsupervised e-tests, because it is necessary to assess a large number of students and to provide accurate and fast results to both the examinees and the teachers. However, in some groups there were still students with different levels of English, so the authors decided:
* to repeat a placement unsupervised test (same structure of an e-test with different items);
* to prove test results by supervised pen-and-paper test.
The authors objective was to find out if all test results can be comparable, and if the tests measure what they are supposed to measure. The authors compared test results of all military students who study at the Faculty of Military Leadership (FML) and the Faculty of Military Technologies (FMT) in a new study programme. So far only data of the first and the second grade are available, as this study programme started in September 2014. A further objective was to look at how students performed when given English language tests using paper-and-pencil versus computer.
4. Placement analysis
The data have been collected from September 2014 to January 2016. Exploratory statistics was the basis of our survey. The authors used data gathered from two academic years, 2014/2015 and 2015/216. Altogether, test results of 314 students were analysed:
* academic year 2014/2015 altogether 146 students, 84 students from FML and 62 students who studied at FMT.
* academic year 2015/2016 altogether 168 students, 104 from FML and 64 who studied at FMT.
4.1 Basic featu res of statistica l data
Basic features of the variables are expressed in the following table (Table 1) that describe numeric characteristics of the data file. Test T1 was unsupervised e-tests which were taken at the beginning of the first semester, test T2 was unsupervised e-tests which were taken at the end of the first semester. A test (T/S) was supervised pen-and-paper test, whose aim was to prove or reject the relevance of tests T1 and T2. The students took the T/S test at the beginning of the third semester. The first number in the table indicates the year of study (1 is the first year, 2 the second year).
If we look at the data characteristics in detail, it is obvious that basic indicators (number of observations N, mean, median, minimum, maximum, lower and upper quartile, standard deviation, skewness, kurtosis, See Table 1) show statistical differences between FML and FMT. These differences are more apparent from the Box and Whisker Plot (Figure 1). The Figure expresses the basic features of position and variability of the analysed variables.
This Figure (Figure 1) shows that there is no significant difference between test results of the first year students of the Faculty of Military Leadership and Faculty of Military Technologies who started their studies in 2015. More differences are apparent in the test results of students who started their studies in 2014. The first test of the FML (2FML_T1) shows a worse level of English in comparison with the FMT students (2FMT_T1). An interesting fact is that the first test results were worse; however than the test results of the second unsupervised test and the third supervised test do not prove any significant differences. The FMT test results show a growing tendency towards improvement, as the last supervised tests results are the best.
The next step was a test of normality of discussed characteristics which was tested (Johnson, 2006) in all years and tests. The following Figure (Figure 2) is an example of histograms and expected normal distribution for the test results of the FML first year students.
Similarly, according to the test results the normality of all described variables is not rejected at a significance level 5%.
4.2 Internal analysis of categories
After getting the above mentioned test results, the authors wondered whether the students at the test T2 achieved the same, different, or better results than at the test T1. Similarly, they wanted to find out if students at the test T/S achieved the same, different, or better results than at the test T2. Via a t-test for Dependent Samples (Johnson, 2006), the null hypothesis H: p_1=p_2 has been statistically tested against the alternative hypotheses A: ^ p2 (or A: < p2) at the level of significance of 5%. The relations between the tests T/S a T2 have been tested in the same way (Table 2).
According to the test results the null hypothesis H: = p2 has been rejected at the significance level of 5% in the case of the first year both the FML and FMT students. The students achieved statistically the same results at tests T1 and T2 and also the same results at test T2 and T/S.
The statistically significant results have been found at the second year students. In the case of the FML students the null hypothesis H: - p2 has been rejected at the significance level of 5%. Test T2 results have been better in comparison with the test T1 results. On the other hand, the null hypothesis has been rejected at the significance level of 5% at the second year of the FMT students. Their test T1 and T2 results have not been different.
The null hypotheses that students achieve the same results at the test T/S in comparison with the test T2 have been rejected at the significance level of 5% in the case of second year students, at both FML and FMT. Students of both faculties have achieved significantly better results statistically at the test T/S.
The authors' further endeavour has been to investigate the relationships, correlations and especially the outliers, which is an observation point that is distant from other observations (Johnson, 2006). An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. Outliers may include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low.
Linear regression and correlation for variables 1FML_1 and 1FML_2 are shown in Figure 3 and 4. From these Figure it is obvious that some results have been extremely low or high. In the Figure 4 and 8 a blue colour expresses the majority of the test results, and outliers have been shown outside of the blue colour. These Figure (3 and 4) are an example of one variable only. During the analysis the other variables have been tested in the same way.
The next step was to find reasons of extreme outliers via interviews with students. Students admitted that they had cheated in the tests. Some of them wanted to be placed in better study groups in order to have a better chance to achieve better study results and especially to be able to have a better chance to fulfil a fundamental prerequisite for getting a credit in the fifth semester, which is passing NATO STANA 6001 SLP 2222. On the other hand, some students said that a worse group means for them a slower pace of study, revision of grammar, and more concentration on the NATO STANAG 6001 SLP 2222 skills. Both groups of students wanted to achieve their goals by different means. However, after the last testing all students were put in homogenous groups according to their level of knowledge. In homogenous groups the teachers concentrate on students' weaknesses to achieve desired results.
4.3 Comparison of results at the Faculty of Military Leadership and the Faculty of Military Technologies
The authors then compared both faculties and their results. In this case they decided on the t-test for Independent Samples (Johnson, 2006) at the significance level of 5%.
It is possible to confirm that the FML and FMT students achieved statistically identical results for all monitored tests and years. The FMT students have achieved better results than the FML students in the second year of their study only when the T/S tests have been compared at the 5% significance level.
5. Conclusion
This article is aimed at measuring the comparability of a pen and paper tests and computer-based tests of two computer tests. Thus far, the authors have found that there is no significant difference between the computer-based tests and the pen-and-paper tests. This indicates that these variables have no effect on the scores of computer-based tests, and consequently there is no impact on the overall validity of the tests. To sum up, there was no significant effect of the testing mode on the overall reliability and validity of the tests.
CALT expands the availability of computer-based testing with all its advantages, and will undoubtedly become a major medium of test delivery in the future. Through all our computer-related language teaching and testing efforts, however, quality and reliability considerations of meeting the standards must be of primary importance. Nonetheless, it is obvious that the computer assisted language testing does not make a good language test without sophisticated expert knowledge of test writing and providing validation.
References
Alderson, J. C. (2000) Technology in Testing: the Present and the Future. System, 28(4), 593 - 603.
Bañados, E. (2006) A Blended-learning Pedagogical Model for Teaching and Learning EFL Successfully through a Networked Interactive Multimedia Environment. CALICO Journal, 23(3), 533-550. Special Issue: What does it take to teach online? Towards a Pedagogy of Online Teaching and Learning.
Bennett, R. E. (2002). Inexorable and inevitable: the continuing story of technology and assessment The Journal of technology, Learning, and Assessment, 1(1), 1-23.
Chapelle, C. A., & Chung, Y. R. (2010) The promise of NLP and speech processing technologies in language assessment. Language Testing 27 (3), 301-15.
Chapelle, C. A., & Douglas, D. (2006) Assessing language through computer technology. Cam-bridge, England: Cambridge University Press.
Cechova, I., Neubauer, J., Sedlacik, M. (2014) Computer-Adaptive Testing: Item Analysis and Statistics for Effective Testing. Proceedings of the 13th European Conference on e-Learning ECEL-2014. Aalborg University Copenhagen, Denmark, 2, p. 106-112.
Cerna, M. (2014) Trends in Acceptance of Social Software Applications in Higher Education from the Perspective of University Students - Case Study, Proceedings of the 13the European Conference on e-Learning. Copenhagen, ISBN 978-1-910309-67-4.
Dillon, A., (1992) Reading from paper versus screens: A critical review of the empirical literature. Ergonomics, 35, 1297-1326.
Graham, D. (2004) A survey of assessment methods employed in UK higher education programmes for HCI courses, Proceedings of the 7th HCI Educators Workshop (Preston, LTSN), 66-69.
Jamila, M., Tariqb, R. H., Shami, P. A. (2012) Computerized vs paper-based examinations: perception of university teachers. TOJET: The Turkish Online Journal of Educational Technology - October 2012, volume 11 Issue 4.
Johnson, R. A., Bhattacharyya, G. K. (2006) Statistics: Principle and Methods. 5-th ed. Hoboken: Willey.
Karadeniz, S. (2009) The impacts of paper, web and mobile based assessment on students' achievement and perceptions. Scientific Research and Essay, 4(10), 984 - 991. Available at: http://www.academiciournals.org/sre. accessed 21 May 2016.
Laghos, A., Zaphiris P. (2009) Computer-Aided Language Learning. Available at: http://www.igi-global.com/dictionarv/computer-assisted-language-testing-calt/5113. accessed 17 May 2016.
Noijons, J. (1994). Testing computer assisted language tests: Towards a checklist for CALT. CALICO Journal, 12(1), 37-58.
Noyes, Jan M., J. M., Garland, K. J. (2008) Computer- vs. paper-based tasks: Are they equivalent? Ergonomics Vol. 51, No. 9, pp. 1352-1375.
Olson, L. Impact of Johnson, R. A., Bhattacharyya, G. K. (2006) Statistics: Principle and Methods. 5-th ed. Hoboken: Willey.Paper-and-Pencil, Online Testing Is Compared. Available at: http://www.edweek.org/ew/articles/2005/08/31/01online.h25.html. accessed 21 May 2016.
Sim, G., Holifield, P., Brown, M. (2004). Implementation of Computer Assisted Assessment: Lessons from the Literature. ALT-J, Research in Learning Technology, 12 (3), 217 - 233.
Pathan, M. (2012) Computer Assisted Language Testing [CALT]: Advantages, Implications and Limitations. Available at: http://www1.udel.edu/fllt/main/FLMediaCenter/Computer Assisted Language Testing CALT Advantages Impli cations and Limitations-libre.pdf. accessed 11 May 2016.
Pommerich, M. (2004). Developing computerized versions of paper-and-pencil tests: Mode effects for passage-based tests. The Journal of Technology, Learning, and Assessment, 2(6), 3-44.
Rezaie, M., Golshan, M. (2015) Computer Adaptive Test (CAT): Advantages and Limitations. In: International Journal of Educational Investigations. Vol. 2, No. 5: 128-137, 2015 (May) ISSN: 2410-3446.
Marek Sedlacik and Ivana Cechova
University of Defence, Czech Republic
Copyright Academic Conferences International Limited Oct 2016