Items Quality Analysis Using Rasch Model To Measure Elementary School Students’ Critical Thinking Skill On Stem Learning

Critical thinking as one of the 21st century competences required by students needs to be developed and analyzed by employing qualified assessment instrument. Test is a kind of critical thinking assessment instrument which quality is developed and analysed to create a meaningful learning. A total of 10 multiple choices items were developed based on critical thinking indicators. The items were then given to forty two 4th grade students in one of the elementary schools in Tasikmalaya-West Java after obtaining STEM learning. Focus group discussions were conducted to construct and validate the instrument. The result of the test was analyzed using Rasch model with the assistance of Winsteps software version 3.75. The results indicated that the analysis using the Rasch model could explain the critical thinking items’ quality based on the level of difficulty and suitability and could categorize students’ abilities and their suitability for STEM learning conducted.


Introduction
Assessment is considered as an important aspect to measure students' skill. The 21st century skills have become the skills that students must acquire in accordance with the development of science and technology (Kivunja, 2015;Hugerat & Kortam, 2014). In addition, critical thinking is included as one of the skills (Kay & Greenhill, 2011;Sahidah Lisdiani et al., 2019). It is a fundamental skill in assessing and making decision (Fisher, 2011;Kay & Greenhill, 2011). Moreover, it is a competency needed by students for their personal and professional lives in the future (Bezanilla et al., 2019). The critical thinking skill was taught by approaching and solving the problems based on persuasive argument, logic, and rationality which involves verification, evaluation, choosing the right answer for the task given and reasoned rejection of other alternative solutions (Barnhart & van Es, 2015). Various indicators of critical thinking skills that can be developed based on "Assessing 21st Century Skills for Teachers and Students" were analyzing arguments, claiming or proving; making conclusion or reasoning; judging or evaluating; making decisions or problem solving (P21 Partnership for 21st Century Skills: http://www.p21.org/our-work/p21-framework). These indicators can be used as a reference to assess students.
In general, there are limited qualified instrument to measure students' critical thinking skills, especially at the elementary school level. The items provided tend to be conceptual and in remembering level which did not show the learning authenticity and the assessment process. Authentic assessment can be carried out if the learning process was authentic as well (Swaffield, 2011). Moreover, teachers need to master critical thinking skills to understand what must be taught and evaluated. However, teachers often put aside the importance of presenting critical based learning (Kek & Huijser, 2011). Elementary school teachers focused on activities that only develop low-level thinking skills without considering other activities that demand students' critical thinking (Assaraf & Orion, 2010). The remembering activities which commonly implemented by teachers were not relevant to the concept and meaningful learning (ŽivkoviĿ, 2016). There are many concepts in science material taught in elementary schools including force and motion of objects. This concept is one of the most important concepts in teaching Science as it is the basic concept required to understand advanced materials. Force and motion are the basic concepts for studying mechanics at a higher level, especially Newton's laws of motion (Panprueksa et al., 2012). Teaching the concepts of force and object motion requires an integrative approach of Science, Mathematics, and Engineering so that the learning is meaningful for students.
Test result in Indonesia showed a discrepancy between the national and international test instruments (Winarti & Patahuddin, 2017). Indonesia generally acquired the lowest score on international tests such as TIMMS, PIRLS, and PISA compared to other countries. This lack of learning outcomes indicated that some distressing trends in Mathematics and Science. Only a small number of students performed well at TIMSS and PIRLS, while at PISA, there were no students as samples who performed well at level 6 (the highest) in Mathematics or Science from 2009 (Pedro & et.al, 2013). PISA data in 2018 demonstrated that Indonesia only reached level 1 for Mathematics, Science and reading skill (OECD, 2019). The tests presented in those international instruments generally explored higher order thinking skills, more specifically on critical thinking. Therefore, the thinking skill needs to be trained through an appropriate learning process so that more authentic results are obtained. One of the recommendations from PISA was to emphasize integrated learning: integrating different subjects, integrating diverse students, and integrating various learning contexts such as real-life contexts with a variety of resources from the community. These various learning processes need to be designed so that students can be successful authentically.
The implementation of learning by exploring authentic abilities in this study was carried out by applying STEM (Science, Technology, Engineering, and Mathematics) learning. It is expected that the integrated approach of STEM education can support the students in the future to solve real-world problems by applying across disciplines concepts and critical thinking, collaboration, and creativity (Burrows & Slater, 2015). The skill capacity resulted from STEM learning overlaps with the skills needed in 21st century education. Therefore, STEM learning can bridge the gap between education and the skills needed in the 21st century, especially critical thinking skill (Putra & Kumano, 2018).
One of the things that influences the success of STEM learning in schools is the curriculum structure and the skill level and teachers' readiness to teach it (Blackley & Howell, 2015). The STEM learning approach is increasingly popular, but remains challenging and difficult to understand for teachers (Wahono & Chang, 2019;Shernoff et al., 2017). For teachers, content knowledge is the basis for applying STEM approach in the classroom (Putra & Kumano, 2018). However, most teachers had acquired training in only one subject (Honey et al., 2014), and most schools and classes at all levels separate STEM as a specific subject. This is a significant challenge for teachers who are interested in implementing integrated STEM. In the context of elementary education in Indonesia, only science and mathematics are included in 2013 curriculum, while technology and engineering subjects are only a minor part or not included in the curriculum. Although STEM education in elementary schools emphasizes more on Science and Mathematics, STEM can assist students to develop their critical thinking skill. Measuring critical thinking skill demands a good test instrument. A good test instrument must meet several criteria, including having various good items' difficulty level and suitability of the items with the indicators being measured. This test instrument can be used as an assessment to measure elementary school students' critical thinking skill. It is essential to determine the instrument quality. Therefore, to find out the instrument quality, a good instrument analysis is needed.
Learning assessment provides good information for teachers to help students learn better. Beside the tools from the Classical Test Theory (CTT) approach that commonly used by teachers, another approach called objective measurement which based on probability is an alternative tool that can provide a more precise measurement. The Rasch model that provides psychometric analysis technique can be used by teachers to develop test items and to present relevant information related to students' learning assessments (Sumintono, 2018). The analysis of this test instrument using the Rasch model is included in the item response measurement theory. This measurement could explain the interaction between the subjects and the test items. This will make the measurement have more precise and objective results (Sumintono & Widhiarso, 2014). Furthermore, the Rasch model is a well-studied measurement approach that models the relationships between item difficulties, people's abilities, and the probability of responses given (Andrich, 1981). The advantage of the Rasch model compared to classical theory is that this model can identify wrong answers from experts, identify improper judgments, and predict missing data based on systematic response patterns (Goodwin & Leech, 2003;Ratna et al., 2017;Fahmina et al., 2019). Using the Rasch model, this study aims to determine the quality of the items in measuring the critical thinking skill of elementary school students and to measure the level of students' critical thinking skill after the students learn using STEM.

Method
The data were collected by distributing 10 critical thinking based written multiple choices test items referred to "Assessing 21st Century Skills for Teachers and Students" to 4th grade students in one of the elementary schools in Tasikmalaya, West Java. The test was constructed by adapting the teaching materials and learning processes carried out previously. The written test items were not constructed independently but always accompanied by the construction of other teaching devices such as lesson plans, media, students' worksheets, and teaching materials. The written test with other teaching devices were reviewed by conducting focus group discussions (FGD) with STEM learning development team to compile and validate the instruments. The acquired suitability of teaching material and the scope of the learning process from the FGD is the final result of the instrument validation process. The test was given to students after they took part in STEM learning. The test results of these students were then analyzed using the Rasch Model. The Rasch Model software application used is Winsteps 3.75. The process of analyzing the data is illustrated in Figure 1. The description of the result of the written test development form is illustrated in Table 1. Situation during the test Done in the following day after the STEM-based learning implemented The Rasch Model was used to analyze the written test results related to students' critical thinking skill indicators after the implementation of STEM-based learning and identify the use of constructed written test items. Other description result was the STEM learning process used to help explaining the result Winsteps 3.75 display. The analysis was done analytically by describing Wright map and table generated and displayed from Winteps software 3.75. One of the main components of Rasch's analysis is the Wright map that visually depicts the relationship between people and the question items (Wilson, 2008).
The person-infit and person-outfit statistics from the Rasch model were used as person fit statistics to detect deviant responses (Widhiarso & Sumintono, 2016). Based on the person-fit scores, participants had three categories. They were classified as high, medium and low from their skill analysis. On the other hand, the person-fit score for the quality of the items was based on the chosen answer to the written test items given to the students. Analysis was then carried out to critical thinking indicators related to students' chosen answers in the written test result.

Result and Discussion
Developed critical thinking written test indicators are listed in Table 2. The development of these test items was conducted through an intensive FGD process by deriving from the curriculum at first. Therefore, the materials to be discussed during the learning could be obtained which then became the basis for making test items. The determined critical thinking indicators were then employed to make written test in the form of multiple choices items with the number of items for each critical thinking indicator was different. The unequal item number for every indicator was based on several things, including time needed to construct the item, similarity of the items model developed within the scope, and the depth of material presented as problems. The time allotment to do every item was similar for every item. Thus, students are expected to answer the questions with a more measured time. Question items with similar intention to measure were selected based on the comprehensiveness. This was related to the depth of the materials. Items which were not depth enough and derived from low-level thinking skill were eliminated by integrating them with higher-level thinking items. The illustration of the items in the form of Wright map is presented in Figure 2. The written instrument that is going to distribute to the students Done by using the Rasch model analysis based on Winsteps 3.75 The cause of students' critical thinking skill items' difficulty level with similar scale. The left side of the Wright map illustrates the distribution of students' abilities. P05 student acquired the highest level of ability with a logit value above the standard deviation (T) which shows different high intelligence (outliers), while P15 student with a logit value below the T limit indicates a very low ability. On the other hand, the right side of Wright map illustrates the distribution of items' difficulty. Item number I8 was categorized in the highest difficult level with a logit value beyond the T limit. This indicated that the probability to work on the item correctly was very small. Item number I8 with the other two items belonged to the claiming indicator. However, based on the Wright's map, the distribution items in this level was varied. Item number I8 was categorized as difficult level, number I5 was medium and number I4 was regarded as an item that could be answered by most of the students. Nevertheless, there were 9 students who could not answer the question. This signified that there were approximately 21% out of 42 students who thought that item number I4 (item with low difficulty based on the Wright's map) was challenging to answer correctly. These 21% of the students based on the distribution in Wright's map generally thought that some of the questions presented were difficult. Wright's map shows that there were questions in evaluating indicator (I6, I9 and I10) possessed almost similar difficulty level. On the other hand, only one student (P05) having good ability was able to answer all the questions presented. The distribution in this Wright map was a general description but it could provide quite clear interpretations regarding the items' difficulty level. Further analysis of the Wright map is explained using a more analytical table based on the distribution of the written test items' difficulty constructed and the distribution of students' abilities from the written test result Items' Difficulty Level Analysis (Item Measure) Table 3 presents several columns that provide information about each item's difficulty level. The classification of the items' difficulty level was based on the combination of standard deviation (SD) value and the average logit value (Sumintono & Widhiarso, 2015). The categories were tough items with logit value greater than 1SD; difficult items with logit value of 0.0 +1 SD; easy items with logit value of 0.0 -1 SD; and very easy items with logit value smaller than -SD. Based on data in Table 3 presented above, the result of the items analysis can be grouped as follows: 1) tough items group for item no. 8 (I8), and item no. 2 (I2); 2) difficult items group for item no. 5 (I5), item no. 10 (110), item no. 6 (I6), and item no. 9 (I9); 3) easy items group for item no. 7 (I7), and item no. 1 (I1); 3) very easy items group for item no. 3 (I3), and item no. 4 (I4).
From the description above (data in Table 1 and Table 2), it can be implied that there were different difficulty levels for each item in the same indicator. The items' difficulty level within an indicator based on the results of the test are different. Similar items' difficulty level for students to answer the questions about critical thinking was in evaluating indicator. These results indicated that the items' type for each number given has equal weight for the students. On the contrary, the items given in claiming indicator possessed different difficulty level within the indicator. This showed that the written test items given had different difficulty weight for each item even though they were in the same critical thinking indicator. The differences among the difficulty level indicated that the items developed did not have difficulty level consistency even though they were in the same indicator. Moreover, it is possible for the items developed after being tested on students to have different perceptions. The difficulty level categories can be obtained after field trials. Therefore, it was very possible that the result of the filed trial had different or similar level of difficulty within the common indicator.
The determination of the items' difficulty level using Rasch analysis was not based on similar percentage distribution as in the case using conventional analysis. In conventional analysis, the categorization of items' difficulty level was conducted by using the percentage of upper, lower and middle limits. The 25% were categorized for difficult and easy items and 50% for medium questions. The division of this percentage was usually done by directly arranging students using normal curve. The normal curve showed the ideal condition for the items quality which must meet the criteria for a balanced number of items based on percentages (Arikunto, 2012). For example, 25% each for a number of easy and difficult items and 50% for a number of medium items. However, Rasch's calculation is largely determined by the result of students' responses / answers to the problem. Therefore, it is common that the result of this study indicating no standard calculation of the items' percentage of questions for every difficulty level category. The results of Rasch's analysis were real based on the students' responses or answers. The closer the items to the normal curve distribution, the more proportional the distribution of the items according to the level of difficulty. If we put the data in Table 4 in percentage, the result would be 20% items were tough; 40% items were difficult; 20% items were easy; and 20% items were very easy. If we use this percentage, the distribution of the items was nearly resembling the normal curve with an assumption that there was a proportion of normal difficulty items. If we put them in normal proportion, there would be 20% items were tough, 30% items each were difficult and easy and 20% items were very easy. Further Rasch's analysis to test the suitability of the difficulty level for each item will be presented as followed.

The Items' Suitability (Item Fit Order)
The item fit level can be seen by using three criteria, namely outfit means-square value In Table 5, the MNSQ scores for all items were accepted and the ZSTD values for all questions were also accepted, but only item 1, 2, 4, and 7 are accepted in the PT-Measure Corr values. From the description and the table above, items number 1, 2, 4, and 7 met the MNSQ, ZSTD, and PT-Measure Corr values. On the other hand, items number 3, 5, 6, 8, 9, and 10 only meet the MNSQ and ZSTD values. If the items did not fulfil all the three criteria (MNSQ, ZSTD, and Pt. Measure Corr), it can be concluded that the questions were not good enough so that they need to be repaired or replaced (Boone et al., 2014;Bond & Fox, 2015). Therefore, referring to that statement, all items analyzed had accepted difficulty level and were worth to be maintained since not all categories were not met. The result of this Rasch analysis showed that there were various levels of difficulty according to Table 4. By using this test, the items constructed have appropriate difficulty level based on data presented in Table  4 which had four levels, namely tough, difficult, easy and very easy. Good test items could identify students' various abilities with diverse levels of difficulty. If the test level of difficulty is high, it can be confirmed authentically that students cannot answer correctly or do not understand the questions given. However, if the test difficulty level is low, it can be confirmed authentically that many students can answer correctly or easily. The analysis of the Rasch model based on the above result can determine the validity of a test well (Baghaei & Amrahi, 2011).
The result presented were based on the teaching material related to STEM learning that was carried out. Referring to the data, the test items were appropriate to be used to identify students' critical thinking skills in the learning process carried out previously. However, the difficulty level of this written test items would be different if it was carried out to other students in different schools and conducted by different teachers. The result obtained would be similar if the test was given to students and teachers with similar characteristics. For example, similar school cluster characteristic and the teachers' level of understanding of STEM learning. Therefore, it is important to conduct trials on other research by applying different methods to be more reliable. To see whether any differences or similarities resulted from the implementation of STEM learning and the critical thinking based written test items.

Students' Critical Thinking Level Analysis (Person Measure)
Beside analyzing critical thinking items and their indicators, an analysis of students' abilities in working on written test questions was conducted. The students' critical thinking skill level can be identified through their work on written test items since it provided information about the effectiveness of STEM-based learning implementation carried out previously. Therefore, the result of this analysis can provide more effective recommendations in helping students during the learning process, especially STEM learning process carried out previously. The data on the students' skill level is presented in Table 6. The information of students' skill level categories can be seen from the standard deviation (SD) value and the starting point of the average logit person value (Sumintono & Widhiarso, 2015). Based on the results of the Rasch model through the use of Winsteps 3.75 there were three students' skill level categories, namely high, moderate, and low. The result of these categories was based on the SD value (standard deviation = 0.93) and MEAN value (-0.43). Thus, the range of the category values are as followed: if the students' skill> SD (0.93) then they possessed high skill, if the SD (0.93) <students' skill <MEAN (-0.43) then they were categorized as moderate, If the students' skill <MEAN (-0.43) then they had low skill.
Rasch model analysis was able to identify students' abilities so that they can be categorized into high, moderate or low level. The categories were specific to the results of the written test items given. The result of the test indicated that students' critical thinking skill after STEM-based learning implementation was mostly in moderate and low level. Only seven students who were in high category. Therefore, it can be concluded that the critical thinking based written test items provided optimum information when it was given to students with moderate and low abilities. Similar to the items' category, the determination of student categories using conventional analysis was based on the rank from the highest to the lowest. For example, high level students for 25%, moderate level students for 50% and low level students for 25% so that a normal curve can be obtained. Rasch analysis provided authentic result of students' level without showing percentages like in normal curve. However, if the percentage of these students' categories is close to or equal to the normal curve, it can be concluded that the students' abilities in the class were varied. This is also largely determined by the construction of the items given.
The results of Rasch analysis illustrated that the STEM-based learning carried out was only able to make 7 students out of 42 students (17%) in a high category. On the other hand, there were 19 students (45%) in moderate category and 16 students (38%) in low category. Thus, normal curve could not represent the students with high and low abilities. The data described that students with high skill were not balanced in number or percentage. Therefore, it could be verified that students' skill in learning needed to be improved, especially critical thinking skill.
The result of the Rasch analysis can also provide recommendation that based on the result of the critical thinking based written tests, STEM learning needed to optimize activities which can support and explore students' critical thinking skill. However, the result was only a recommendation and it will only work on STEM-based learning process done with critical thinking based written tests by using situation, time, conditions and respondents with STEM learning criteria.

The Analysis of Students' Critical Thinking Skill Suitability Level (Person Fit Order)
After mapping the students' skill into high, moderate, and low level, an analysis was carried out to find the students' critical thinking skill suitability level by detecting students' response patterns in their written test. The result of this analysis was able to provide patterns of responses which are not suitable with students' answer based on their skill analyzed previously. Table 6 shows the data about the suitability level of students' skill to work on the critical thinking-based written test.
Based on the data in Table 7, the MNSQ score of student P22, P37, and P05 was not accepted, the ZSTD value of all students could be accepted, and the Pt-Measure Corr value of student P03, P06, P07, P08, P09, P10, P15, P17, P17, P18, P20, P22, P23, P27, P28, P29, P30, P31, P32, P34, P35, P37, P38, and P41 was not accepted. If the students' skill in the three criteria (MNSQ, ZSTD, and Pt. Measure Corr) was not fulfilled it can be confirmed that the skill was not suitable so it needed to be reviewed or there was a biased skill (Boone et al., 2014;Bond & Fox, 2015). Therefore, according to that statement, all the students' categories (high, moderate and low) analyzed had a suitability level that can be confirmed and was not beyond the reasonable limits of the patterns (high, moderate or low).

Table 7. Person Statistics: Misfit Order
An example of the students' responses to the written test analysis can be explained with a scalogram (Guttman Scale) in table 7. Student P22 had an unusual response pattern according to their skill level, where P22 student had a low skill level but was able to answer question no. 8 which had a high level of difficulty (tough). In addition, in answering questions with similar level of difficulty (very easy or item no. 03 & no. 04), student P22 do gave wrong answer to question no. 04 but answered question no.03 correctly. Based on the description, it can be seen that student P22 was guessing and student P22 was inaccurate in answering the questions. Another case was student P05 who had high level of skill and could answer the questions in order from the tough level (item no.02), difficult level (item no. 05, 06, 09, 10), easy level (item no. 01 and 07), and very easy level (item no. 03 and 04). This indicated that student P05 had the suitability to answer the critical thinking based written test. The analysis using the Rasch Model in Table 7 was used to determine the suitability of students' responses of the critical thinking based written test. On the other hand, the result presented in Table 8 was used to identify the direct causes of response patterns that are suitable or not with the students' critical thinking based written tests. For example, student P05 had the best response suitability. This student was able to work on problems ranging from the easiest item (I4) to toughest item (I2) in the correct sequence. This signified that these students had more authentic suitability of the skill. It is different from student P15. This student could only work on item I10 which was more difficult compared to item I4 to I5. Therefore, this student's answer for I10 was accidentally correct as student P15 only predict the answer. The student P15 did not understand the concept. Another example which was commonly found can be seen from student P22. This student did not have a consistent skill. The questions were answered randomly. Student P22 answered correctly item I3 which was in very easy level, item I5 which was in difficult level and item I8 in tough level, while the other items were answered incorrectly. It can be assumed that the student with this pattern did not have full understanding of the concepts they have learned. It could be predicted that this student's correct answer on difficult and tough items was only a coincidence.
The result of this data can provide information to the teachers to identify the students' skill and suitability in developing critical thinking skill during STEM-based learning process. Thus, the result provided a recommendation to those who implement the learning to implement STEM-based learning effectively by giving closer attention to students who have inappropriate skill and to improve the students who still have low critical thinking skill. The analysis of this scalogram can further be used by the teachers as the executor of STEMbased learning to describe what the students had acquired from the test results.
The Rasch model analysis had provided comprehensive information about data processing based on students' responses on the critical thinking based written test. Furthermore, the test items' level of difficulty with its indicators for STEM learning could be identified. These results illustrated that the items had been constructed were able to describe the students' skill patterns and their suitability. However, the result of the Rasch Model analysis was more specific to provide a comprehensive picture of STEM learning conducted at that time. The result of the Rasch model analysis could be different or similar depending on the conditions and the learning situations, such as the students' characteristics and the implementation of STEM-based learning in certain classrooms or schools. Nevertheless, the Rasch Model analysis process can be used by teachers in schools to make a comprehensive identification of the learning process connected to the students' responses in the written test Determining the reliability of tests in classical analysis usually uses raw score intervals which comparison is not clear such as in counting with KR-20 formulation. Therefore, there are extreme scores often included in the reliability test. However, the extreme scores do not have an error variance which make it possible to question the reliability of the test (Boone & Scantlebury, 2006). Thus, classical test theory can have a single standard measurement error. The Rasch measurement for each item and each called as error. If a very large or small percentage of students answered the item correctly, this resulted in a greater error than the items targeted for the average level students (Baghaei & Amrahi, 2011).

Conclusion and Suggestions
A comprehensive picture has been presented from the result of the Rasch Model Analysis regarding the items' difficulty and the students' skill measured using critical thinking aspects in STEM-based learning. The distribution of the items from the Rasch Model analysis based on critical thinking indicators resulted on four categories, namely tough, difficult, easy and very easy. The test items indicated that they had various degrees of difficulty for diverse students' abilities (high, moderate and low). Therefore, the test items were suitable to be used in STEM-based learning. On the other hand, the students' critical thinking skill levels in STEM-based learning were generally in the moderate and low categories. Thus, the result of the Rasch Model analysis provided the students' level of suitability which implied that students with low abilities should be assisted more. Moreover, the result of the Rasch model analysis could also be used as a reflection and recommendation for teachers since they are the executor. This is intended to improve the learning process. Last, the Rasch model analysis of the test result can be used by teachers to identify the constructed items quality and the students' abilities resulted from the learning process in school.