General TCE Information

 

Printable version        E-Mail OIRPS           TCE Home Page

 

 

 

Part 1. The TCE System

 

 

Part 2. Understanding TCE Results

 

Part 3. Using TCE Results

 


About This Guide

 

Student Ratings at the University of Arizona are called TCEs – short for Teacher-Course Evaluations. This guide provides assistance in selecting TCE questionnaires, understanding TCE reports of results, and using TCE results both for improving teaching and for performance appraisal. Part 1 describes questionnaires, procedures, and reports of results. Part 2 describes the statistical information contained in the results reports and covers key concepts for interpreting TCE results. Part 3 offers suggestions for using TCE results for summative (performance appraisal) and formative (improvement/development) purposes. OIRPS welcomes your comments and questions. We are committed to making our services responsive to the needs of faculty and administrators.

 

About Student Ratings of Instruction

 

Student ratings of instruction, properly constructed and administered, provide valid and reliable data for improving teaching as well as for documenting teaching performance for administrative review. This claim is supported by a large body of research (e.g., Aleamoni, 1981; Centra, 1989; Doyle, 1975; Marsh, 1992; Theall et al, 2001). Student ratings are currently used in over ninety-five percent of postsecondary institutions because they are: 1) multidimensional, 2) reliable and stable, 3) relatively valid against a variety of indicators of effective teaching, and 4) relatively unaffected by a number of variables hypothesized as possible biases.

 

Administrators use student ratings data both in making personnel decisions (summative evaluation) and in mentoring faculty regarding their effectiveness as teachers (formative evaluation). Faculty use student ratings both for documenting and for improving teaching. Summative and formative purposes, though related, are conceptually and practically different in important ways. Most importantly, appropriate use of ratings data for personnel decision- making requires careful data collection procedures and clear policies about interpretation, consistently applied. Improper uses of ratings expose units to the possibility of litigation as well as negatively affecting collegiality and productivity (Franklin and Theall, 1989).

 

Although ratings are necessary, they do not provide sufficient information for a comprehensive evaluation of teaching. Teaching is a multidimensional activity comprising course planning, classroom instruction, mentoring and advising of students, assessing student work, etc. Student ratings alone are insufficient because students are unable to observe or unqualified to judge many aspects of teaching (e.g., the instructor's content expertise and instructional design skills). Evaluation specialists recommend consideration of at least two additional sources of information, such as classroom observation reports, self-evaluation statements, peer assessments of course

materials, and evaluations by instructional specialists (e.g., Arreola, 1995; Braskamp and Ory, 1994; Centra, 1993; and Theall et al, 2001).

 

TCE materials and services are intended to help faculty and administrators get the most benefit from TCE reports. In addition to this guide, A Short Guide to Evaluating Teaching http://aer.arizona.edu/Teaching/docs/ShortGuide.pdf addresses the broader dimensions of evaluating teaching, covering multiple methods in addition to student ratings, and illustrating how student ratings data may be integrated with information from other sources. It is addressed both to individual faculty and to department administrators and P&T committees.


 

Part 1. The TCE System

 

Obtaining TCE Services

 

Standard TCE services are available at no cost for all UA academic units except those in the College of Medicine. The Extended University supports TCE services for Summer and Winter Sessions, and Evening and Weekend Campus. Online TCE service is available for courses with significant instructional computing components, such as distance education courses. Contact OIRPS at 621-9585 for more information about online TCEs.

 

TCE services are available to all UA instructors for courses with five or more enrollment and are normally ordered through the instructor’s academic unit. TCE packets are not created for courses with less than five enrollment because "ratings of courses based on five or fewer completed forms are of questionable reliability and validity" (Braskamp and Ory, 1994, p188). Each participating unit has a TCE “contact” who coordinates TCE materials. (See Contacts [http://aer.arizona.edu/Teaching/Contacts/contacts.asp].) Each semester, the contact identifies faculty and courses for which TCE materials will be prepared. Contacts should inform faculty in writing about their unit’s policy on participation in the TCE process, who sees the results, and how the results are used. Faculty should provide contacts with updates on course information (enrollment changes; special circumstances that might affect the TCE process) as well as requests for particular questionnaires. It is critical that department contacts provide OIRPS with accurate information or instructors may not receive the proper forms in a timely manner.

 

TCE Questionnaires

 

http://aer.arizona.edu/questionnaires/quesmain.asp

 

SHORT FORM AND LONG FORM

 

Most University of Arizona faculty may choose either the Short Form or the Long Form. The Short Form contains a small core of eleven global questions suitable for use in summative evaluation along with six questions about student demographics. The Long Form contains the same core and demographic questions plus more specific questions designed to provide detailed feedback. We are currently developing a system that will allow faculty to “customize” the Short Form by adding questions on its back. Check the OIRPS website (http://aer.arizona.edu/) for updates on this option.

 

When ratings results are to be used for summative purposes (performance appraisal), a small set of global questions provide an adequate basis for judgment while minimizing evaluator labor (Seldin, 1999; Theall et al, 2001). For formative purposes (development/improvement), a more detailed and comprehensive questionnaire is desirable. OIRPS reconciles these needs by providing the choice of forms. Regardless of the form chosen, only results for the core questions are provided to department heads and made available to the UA community. Results for other questions, as well as student written comments, are considered direct feedback to instructors inappropriate for consideration in performance appraisal. (Reports for graduate teaching assistants (GTAs) are treated differently; see below.)

 

Results from the core questions can be used to determine the overall effectiveness of an instructor, but they provide little in the way of specifics. Instructors satisfied with their ratings may wish to use the Short Form for most classes and the Long Form occasionally. Instructors actively involved in developing their teaching may want to use the Long Form or customization option more frequently.


 

GTA FORMS

 

Two forms have been developed for lab or discussion sections attached to lecture classes. These sections are most commonly taught by GTAs. Like the Long Form, they contain questions regarding instructor behaviors to provide adequate data for developing/improving teaching. However, the full complement of core questions is NOT included because several of the questions are unsuitable for attached sections (e.g., the value of the outside assignments). Results for GTAs are sent to the GTA’s unit head.

 

TEAM FORM

 

The Team Form is similar to the Short Form, except that the words “the instructor” are replaced by “the instructional team.” It is most appropriate for “simultaneous” team- taught courses, where two or more instructors co-teach the course throughout the semester. In this case, all instructors attend class meetings and more than one instructor may be responsible for a single class session.

 

The drawback of using the Team Form is that instructors will not have individual TCE results to put forward for administrative review. Instructors teaching in simultaneous teams may request that a separate TCE form be prepared for each instructor. The drawback to this approach is that students must fill out two or more forms for the same course, even though only a few questions ask about the instructor.

 

OIRPS recognizes that neither option is entirely satisfactory and is aware that team-teaching is a growing trend. We are developing a questionnaire that will enable students to separately evaluate the members of a teaching team without duplicating other questions.

 

The Team Form is less appropriate for “consecutive” teams, in which one or more instructors teach parts of the course consecutively: e.g., instructor A teaches the first month, instructor B the second month, instructor C the third month. In this case, rating the effectiveness of “the instructional team” is less appropriate. It is also problematical to wait until the end of the semester to evaluate instructors A and B, especially the former. With adequate advance notice, OIRPS can provide TCEs earlier in the semester, so that a TCE can be conducted for instructor A at the end of instructor A’s teaching responsibility. Contact OIRPS for more information.

 

CUSTOM FORMS

 

In addition to the Long and Short Forms, OIRPS offers a number of custom questionnaires, most of which have been developed for individual academic units. Contact OIRPS. 

TCE Questions

 

OVERALL QUESTIONS

 

Almost every questionnaire includes four "overall" questions:

1. What is your overall rating of this instructor's teaching effectiveness?

2. What is your overall rating of this course?

3. How much do you feel you have learned in this course?

4. What is your rating of this instructor compared with other instructors you have had?


 

These are single questions, not composite or average scores based on other questions. Overall questions are recommended for use in performance appraisal because they are applicable across the wide variety of teaching styles and course formats (Theall et al, 2001).

 

The overall instructor, course, and amount learned questions are usually highly inter-correlated. Occasionally, instructors receive somewhat higher ratings than their courses. This may happen with an unpopular required course where students recognize the teacher’s efforts but still feel negative about the course.

 

The comparative instructor question (4) is somewhat problematical since “compared to other instructors you have had” may be interpreted differently by different students (ALL other instructors? Other university instructors? Other instructors in the major?) Often, instructors are rated effective in their own right (per the overall instructor question), but less effective when compared to some reference set of “other instructors.” Instructors whose comparative effectiveness rating is as high as their overall effectiveness rating deserve congratulations. However, OIRPS recommends that only the overall instructor effectiveness question be used in summative evaluation.

 

WORKLOAD, VALUE, AND DIFFICULTY QUESTIONS

 

Virtually all questionnaires include four questions related to workload and value of work: “amount learned,” “difficulty level of the course,” “total hours spent on class-related work,” and “value of hours spent on this class.” These questions gauge the level of perceived challenge and provide clues about student motivation.

 

Analysis of ratings at UA (and nationwide) show that workload and difficulty per se have little if any relationship to overall ratings of instructors or courses (Theall et al, 2001). A course may be rated high in difficulty or high in preparation time for positive reasons (it was fast-paced and challenging) or negative ones (poor organization, confusing assignments). On the other hand, the question asking how many preparation hours students considered valuable to their education has a moderately strong positive association with overall ratings, particularly in applied fields such as engineering and business. Not surprisingly, the higher the number of hours considered valuable, the higher the course and instructor ratings are likely to be.

 

COURSE QUESTIONS

 

Course questions explore aspects of course design and organization, such as outside assignments, in-class activities, and class materials (texts, websites, etc.). Course question results can highlight problems that may be outside an instructor's direct control, such as a poor text. In such cases, the course question results may be relevant to personnel decision- making in the sense of “mitigating circumstances” and should be discussed in the narrative faculty prepare for promotion and tenure portfolios.

Student Questions

 

All questionnaires include questions designed to obtain self- reports and demographic information, useful in understanding student responses.

 

OTHER QUESTIONS

 

The Long, Discussion, Lab, and most custom questionnaires include a series of questions about teaching behaviors which comprise a multi- dimensional profile of teaching (organization and delivery of instructional content, classroom interaction and rapport, classroom management, testing and feedback).

 

Administering TCEs

 

TIMING

 

TCEs are normally administered as paper questionnaires distributed within classes. (See below for information about online administration.) The standard TCE administration period is usually the three weeks preceding finals. It is generally best not to wait until the last day of class for administration. For best results, administer the TCE at the beginning of class on a day when attendance will be at a maximum. Do not tell students beforehand what day the evaluation will take place.

 

ONLINE ADMINISTRATION

 

At this time, TCEs are administered online only for distance education courses and courses taught in computer labs where a workstation is available for each student. OIRPS is considering possible pilots for a more widespread electronic system. If you would be interested in participating in an online pilot, please contact Gwen Johnson (qwj@u.arizona.edu).

 

INSTRUCTIONS FOR ADMINISTRATION OF PAPER QUESTIONNAIRES

 

TCE results can be compromised by improper administration of questionnaires. The instructions below are intended to ensure fairness and anonymity. Failure to follow these instructions may result in the TCE being declared invalid. Please contact OIRPS if you have a problem with this procedure or if you think your student monitor has failed to follow instructions.

 

1. Inspect the contents of your TCE packet when you first receive it. If there are any errors (quantity, form, name, etc.), ask your department 

contact to request new materials immediately.

 

2. Select a date for administering the TCE.

 

3. Select a student monitor. The monitor must be a member of the class, chosen the day the TCE is administered. Do not use your TA or unit staff 

as monitors.

 

4. Discuss the TCE process with the monitor. If you are adding your own materials to the packet, make sure your monitor knows to insert them along with the TCEs. Remind the monitor to follow the step-by-step instructions on the monitor packet. Your monitor should give you a yellow monitor card after turning in the questionnaires.

 

5. Encourage students to offer written comments. Tell students that you will not see the TCE results until after final course grades have been posted. You may also suggest that students disguise their handwriting.

 

6. Allow the monitor to administer the TCEs. Leave the room and remain out of the room while the TCE is being administered, returning only after the monitor has placed the materials in the packet and sealed it according to instructions.

 

7. Point out the official TCE drop box locations listed on the front of the monitor packet to make sure that the monitor knows where to deliver the completed materials.

 

8. Obtain the completed yellow monitor card receipt from your monitor. This is your "receipt" documenting that you used a monitor and who that 

monitor was.


 

 

The TCE Reporting Process

 

INSTRUCTOR REPORTS

 

Beginning fall 2002, instructors will be able to review and print copies of their own reports by clicking on the Instructor Reports button on the OIRPS website (http://aer.arizona.edu/). Written comments will be placed in individually addressed envelopes marked “confidential,” to be picked up by the department contact early in spring semester for fall results and early in June for spring results. Unit heads are provided with summaries of the core question results for each course as well as a set of reports summarizing results for groups of courses (the Comparison Group Summaries). Results for questions other than core questions are considered confidential and provided only to instructors, as are the student written comments. OIRPS recommends against using student written comments in summative evaluation without stringent safeguards.

 

For discussion, see Using Student Written Comments in Summative Evaluation

[http://aer.arizona.edu/Teaching/docs/shortGuide.pdf - page=32].

 

In addition to the standard reports for individual sections, instructors will be able to view and print two additional reports, the TCE History and the TCE Overall Effectiveness Graphics.

 

Individual Section Reports : Reports list results for core questions on the first page and results for other questions on one or more subsequent pages depending on the questionnaire chosen. The final page, the TCE Comparison Report, provides comparison group statistics and a graphic comparison between the instructor’s means for the core questions and the comparison group means for those questions. (This is discussed below under “Comparison Groups.”) Since most undergraduate courses are in two comparison groups, two graphs are provided, along with comparison group statistics. (Graduate courses are typically part of only one comparison group.) If no comparison data is available, no graphic reports are provided. Click to see an example of the report at http://aer.arizona.edu/Teaching/Reports/section.htm.

 

TCE History: The TCE History is a listing of courses taught by the candidate for which a TCE was requested. It may be organized either chronologically or chronologically within levels (grad, upper division, lower division). Courses where no TCE was requested are not listed. To see a sample TCE History at http://aer.arizona.edu/Teaching/reports/history.htm.

 

TCE Comparison Reports: The TCE Comparison report is the final page of the Individual Section Report, providing comparison group statistics and a graphic comparison between the instructor’s means for the core questions and the comparison group means for those questions (see above). Instructors may print a set of Comparison Reports for all courses taught with the past five years independently of their Individual Section Reports. This set of reports may be useful in compiling materials for administrative reviews.

 

Overall Effectiveness Graphics: The TCE Overall Effectiveness Graphic provides an at-a-glance summary of how an individual’s courses within each comparison group compare to other courses in the comparison group. A separate TCE Overall Effectiveness Graphic is provided for each comparison group. Each TCE Overall Effectiveness Graphic shows how the instructor’s combined results for the overall teaching effectiveness question compares to results for other courses in the same subject area at the same level. Each graphic provides a description of the comparison group, a description of the instructor’s sample, and a graphic comparing the confidence interval for the overall teaching effectiveness question for the comparison group to the confidence interval for the instructor’s combined set of courses, as well as for each course individually. For more information, see our Guide to the TCE Overall Effectiveness Graphic

[http://aer.arizona.edu/Teaching/docs/oegraphic.pdf].

 

Problem Reports: If a problem occurs in administering a TCE or if a TCE packet is not returned for processing, a green “Problem Report” is sent, with instructions for requesting a resolution (see Problem Report Codes at http://aer.arizona.edu/Teaching/reports/problems.htm). Some problems can be remedied and a report released. However, because of the number of TCEs conducted, problem resolution requests must be made by May 1 of the following semester for fall courses and by December 1 of the following semester for spring courses.

 

ADMINISTRATOR REPORTS

 

Beginning fall 2002, department contacts may review online and print a variety of reports that allow unit heads to overview department teaching performance. (These replace the reports sent to unit heads at the end of each semester.) Three reports are available, described below.

 

Comparison Group List: a list of comparison groups within the unit showing the number of sections and responses for each group along with descriptive statistics for the Overall Effectiveness Question for that group. You can see an example at http://aer.arizona.edu/Teaching/reports/complist.pdf.

 

Comparison Group Summaries: a separate summary report is provided for each comparison group listed on the Comparison Group List. Each Comparison Group Summary shows the mean at each decile for each of the core questions. Collectively, the Comparison Group Summaries show the range of variation across comparison groups in the unit. The Comparison Group Summaries make it possible to review TCE results for sets of courses and get a sense of the overall student perception of these courses. This report can be used to determine whether data in a comparison group is sufficiently variable to allow meaningful distinction among faculty based on ratings. It can also be used to document overall unit performance for unit or program reviews. You can see an example at  http://aer.arizona.edu/Teaching/reports/compsummary.pdf.

 

TCE Comparison Reports: The TCE Comparison report is the final page of the Individual Section Report, providing comparison group statistics and a graphic comparison between the instructor’s means for the core questions and the comparison group means for those questions. TCE Comparison Reports are available for all courses in the unit taught in a given semester. You can see an example at http://aer.arizona.edu/Teaching/reports/compar.pdf.

 

RATINGS RESULTS REPORT

 

Under Faculty Senate mandate, OIRPS publishes TCE results on the UAWeb at https://aer.arizona.edu/ASUA/. This document, updated each semester, is accessible to anyone in the UA community with a valid u.arizona.edu email account. The Ratings Results Report is designed primarily for student use, so it summarizes only the frequency of response to the core questions and the percentage TCE response. This report is not suitable for use in performance appraisal because it provides no estimate of error and no statistical basis for comparison to similar courses.

 

Comparison Groups

Comparison groups have been established to enable comparison of an instructor’s ratings to ratings for similar courses. Most TCE reports include a “TCE Comparison Report” as their final page. This report provides comparison group statistics and a graphic comparing the instructor’s means for the core questions and the comparison group means for those questions. Comparison groups are based on 1) subject, 2) course format (lab, studio, lecture, etc.), 3) level (lower division, upper division, graduate), and 4) number of students enrolled (below 5, 5-19, 40-59, 60 and above). This means that courses are compared only to courses with the same subject prefix, at the same level. Undergraduate courses are additionally compared to courses with the same subject prefix and level that are of comparable size. Disciplinary area, level, and size are factors known to be associated with systematic variation in ratings, although taken together, they rarely account for more than 5%-10% of total variance. See Systematic Variation in Ratings, below, for more information.

 

The usefulness of comparison data depends in part on the amount of data available. This is a function of both the number of comparable courses in the subject area and the number of semesters for which data is available. Reports for semesters prior to spring 2000 include %Rank and T scores when the comparison group has 50 or more sections. These statistics are omitted in spring 2000 and subsequent reports. If the comparison group has at least five sections, the comparison group mean, standard deviation, median, and 95% confidence interval are given. (See Reading the Results in Part 2 of this report.)

 

As more data becomes available for a unit, more focused comparison groups can be formed. Ideally, a comparison group includes courses of similar size, level, and status of student enrollment (required versus elective). Comparisons involving course format; teacher variables such as rank, years teaching, gender, etc.; and student variables can be made if systematic variation in ratings is found to be associated with such factors. (For more information, see Systematic Variation in Ratings in Part 2 of this report.) In general, small units will have fewer comparison groups; however, over time, more comparisons become possible.

 

Special TCE Services

 

To learn more about the following special TCE services, contact OIRPS.

· creating customized forms for special assessment needs

· receiving individual or group training in the interpretation and use of student ratings for

teaching improvement and/or administrative review processes

· including other questionnaires to be administered at the same time as the TCE

questionnaire



 

Part 2. Understanding TCE Results

 

Overview

 

Understanding student ratings requires an understanding of statistical concepts related to sampling, significance, and precision as well as an understanding of the characteristics of ratings as a measure of teaching performance. Because student ratings statistics do not have the precision typical of statistics in the sciences, it is always important to interpret them in the context of individual and unit patterns. OIRPS offers workshops and consultation on interpreting these statistics and using TCE results appropriately, as well as on other aspects of evaluating teaching.

 

The primary units of analysis in TCE reports are individual student responses within individual sections. Many reports also show summaries of results for the same questions from sets of similar courses. This section describes the statistics used in the reports and offers suggestions for their interpretation, along with information about general characteristics of student ratings. OIRPS recommends a three-step procedure for reviewing TCE reports:

 

Step 1: Check the sample

Step 2: Review individual results

Step 3: Review comparison statistics

 

Checking the Sample

 

Before using ratings, it is important to know how representative the available data is. Standards for samples depend on how the ratings will be used: they should be most stringent when ratings are used in performance review.

 

SAMPLE QUALITY WITHIN SECTIONS

 

For responses in a section to be meaningful for decision- making purposes, they must be representative of the entire class. Information about the sample is printed at the top of each TCE report: number enrolled, number responding, and percent responding. Use Table 1 to decide whether enough students responded for the sample to be usable.

 

The higher the proportion of respondents to those enrolled, the more reliable the results. In general, sections with less than a 50% response rate should not be used for performance appraisal. The smaller the class, the higher the percentage of responses needed to ensure that the sample is representative.1 If the non-response rate seems high, there may be a systematic reason for student absence that might bias results. For example, if ratings are administered the day of a review session when attendance is optional, students for whom instruction has been most effective may be excluded.

 

If only a small fraction of students respond, the responses can only be considered the opinions of those few students – though it may be tempting to generalize if they are positive.

 

1 Interrater reliability coefficients (a measure of agreement within a class compared with agreement among raters across many

classes) are typically in the .70s for 10 student raters, in the .80s for 15 raters, and in the .90s for 20 or more students. It is

interesting to note that interrater reliability for student ratings is much higher than for peer ratings. In one study (Centra, 1975),

the interrater reliability for 15 faculty colleagues was .57, a dramatic contrast with .85 for 15 students.


 

 

 

Table 1: Guidelines for Judging Samples Within Sections

Class size

Recommended response %

5-20

at least 80%, more recommended

20-30

at least 75%, more recommended

30-50

at least 66%, 75% or more recommended

50-100

at least 60%,75% or more recommended

100 or more

more than 50%, 75% or more recommended

 

 

 

SAMPLE QUALITY ACROSS SECTIONS

 

While the results from a single administration of a TCE questionnaire, particularly a long questionnaire, can provide useful information, such results apply to the course as one event in time only. Averaged results from comparable courses taken over several evaluations (each with an adequate sample of response) are more likely to fairly represent teaching ability. A minimum of five courses is recommended. It is also important to ensure that the courses selected are representative. If an instructor’s teaching load is half graduate courses and half undergraduate courses, the sample presented for review should be about half graduate and half undergraduate courses. Most importantly, no single score or set of scores from a single section should be used for judging teaching performance for performance appraisal.

 

SAMPLE QUALITY OF COMPARISON GROUPS

 

Questions to ask about comparison groups include: 

1) Are the courses in the comparison group reasonably comparable in content, size, and instructional methods? 

2) Are there enough courses in the comparison group? 

3) Were a significant number of courses that met the selection criteria for the comparison group not included because their instructors did not participate or because insufficient student response, lack of documented student monitoring, or other errors invalidated the data? 

4) How many different instructors taught courses included in the comparison group?

 

Reviewing the Section Results

 

FREQUENCIES AND PERCENT OF VALID RESPONSE

 

For each question, the distribution of student responses across the possible response choices is given in frequency of responses per option and percent of valid responses per option. Interpreting the data is largely common sense – how many students "said" what, in terms of the available response options for each question. Usually, students are in fairly good agreement in their ratings and scores cluster around two or three adjacent options.

 

For positively stated questions representing behaviors associated with effective teaching, it is desirable for responses to cluster in the first two options, "almost always" and "usually." If a significant percentage of students respond "sometimes," "rarely," or "almost never," the question points to an area of teaching skill that likely needs attention. Responses should cluster similarly for questions with response scales worded "very useful" to "nearly useless."

 

For questions with normatively worded response options such as "among the best" to "among the worst," more caution is needed since the basis for comparison is unknown. For example, if a student has taken only exceptionally well-taught courses, a moderately well-taught course might seem poor by comparison.

 

MEANS, MEDIANS, AND STANDARD DEVIATIONS

 

Means and medians are measures of central tendency, showing the "middle" of a set of scores. The standard deviation (SD) is a measure of how variable scores are, i.e. how spread out they are around that "middle." Means and SDs appear on all reports in both section data and comparison data. Medians appear only in comparison data.

 

The mean for a question is the arithmetic average of student responses. For most TCE questions, means can range from 1 to 5. Most questions are reverse scaled: that is, the most positive option, "A," is scored as 5 points. The "Key" on each question tells how individual questions were scored.

 

The SD gives an approximate measure of agreement or disagreement among raters. Perfect agreement would yield an SD of 0. In a typical class, about two thirds of ratings fall within one rating point above or below the mean and the SD is 1.0 or less. If the SD for a question scaled with 5 points is higher than 1.2, the mean is not a good measure of student response.

 

High SDs occur when opinion in a class is strongly divided between very high and very low ratings or when opinion is dispersed across the entire response scale. Because students and teachers vary, it is possible for a teacher to be "among the best" for some and "among the worst" for others. In such cases, the mean does not represent a "typical" student opinion in any meaningful sense. (Consultation to explore the source(s) of consistently high SDs is available from OIRPS.)

 

CONFIDENCE INTERVALS

 

Most OIRPS reports show a 95% confidence interval (CI) in parentheses to the right of the section means and comparison group means. While the SD gives an approximate measure of the amount of disagreement among students, the 95% CI shows the impact of the disagreement on the precision of the mean as a way of summarizing responses.

 

The 95% CI is similar to the "margin of error," a familiar feature of opinion polls which assigns a value, plus or minus, within which the "true" score occurs once all sources of error and disagreement are taken into account. There is a 95% chance that the true score for a question occurs somewhere in the interval between the two values.

 

Reviewing the Comparison Statistics

 

For spring 2000 and subsequent reports, comparison group statistics appear on the final page along with one or more graphics showing how results for the section compare with results for the comparison group. (This page is titled “TCE Comparison Report.”) For reports issued prior to spring 2000, statistics for the comparison group appear on the Short Report in the column labeled "Comparison Group" (between the section statistics and the columns showing T scores and Percentile Rank Groups (%Rank)).

 

Comparison group descriptive statistics include the number of sections in the comparison group, the grand mean and its 95% CI, and the median of section means for each question. A comparison group mean is the grand mean of a set of section means, not the mean of student responses pooled across the sections. Similarly, the comparison group SD is the deviation of the section means. The median is the halfway point: half of all the means in the comparison group fall above the median and the other half below.

 

Systematic Variation in Ratings

 

Although properly administered student ratings are quite dependable, research shows that there are predictable sources of systematic variation or even bias which should be considered when comparing scores. To address potential concern about three factors known to cause systematic variation in ratings (disciplinary differences, course level and course size), we have based our comparison groups on these variables. As our database grows, other factors may be taken into account. However, research shows that taken together, all the sources of variation listed typically account for less than 5% of variation in overall instructor ratings.

 

FACTORS LIKELY TO CAUSE SYSTEMATIC VARIATION IN RATINGS

 

Disciplinary Differences

 

Significant differences between ratings of courses in different disciplines are well documented. For example, courses in the humanities and fine arts tend to be rated more highly than those in physical and applied sciences. For this reason, most sources agree that ratings should not be compared across disciplines. (If cross-disciplinary comparisons of faculty are necessary, faculty standings within their own comparison groups can be compared.) Unless faculty have recommended combining similar subject areas, our reports always restrict comparisons to the subject area defined by the course subject code, e.g., ANTH, MUSI, POL, etc.

 

Course Level

 

Lower division students tend to give the lowest ratings; graduate students tend to give the highest ratings.

 

Class Size

 

Small classes (fewer than 20 students) tend to receive the highest ratings, while large classes (40-100) tend to receive the lowest ratings. (Extremely large classes (over 100 students) tend to receive intermediate ratings, which suggests that students may have different criteria for evaluating them.)

 

Course Status

 

Students tend to give electives and courses in their majors slightly higher ratings than courses taken to fulfill a college or general education requirement. Research shows a correlation between overall ratings and whether the course is required versus elective. "Required ness" was negatively associated with overall ratings of instructor (-.15), amount learned (-.18), and course (-.23).


 

Semester or Summer Session

 

Summer Session ratings, on average, are significantly higher than fall or spring ratings for comparable courses at UA. Thus, unless otherwise noted, comparison groups do not include Summer Session data.

 

Course Content

 

Differences in ratings are occasionally associated with course content. For example, courses with quantitative content may receive slightly lower ratings than other courses at the same level in the same subject area. Similarly, courses that challenge strongly held beliefs may receive lower ratings from some students.

 

Years of Teaching Experience

 

Instructors with less than one year of experience tend to receive the poorest ratings. Teachers with between three and twelve years experience tend to receive the best ratings, while those with more than twelve years tend to receive intermediate ratings.

 

Improper Administration of Questionnaires

 

Student ratings can be biased by failure to adhere to instructions for administering the questionnaire, such as failure of the instructor to leave the room during administration, failure to preserve student anonymity, administration of the evaluation during finals, and use of prejudicial introductory remarks. (The TCE monitoring system is a strategy to minimize such problems.)

 

FACTORS THAT HAVE LITTLE IF ANY INFLUENCE ON RATINGS

 

Scheduling Factors

 

Time of day and other scheduling factors appear to have little or no influence on ratings. (However, systematic differences in who attends classes at particular times could theoretically have some impact on ratings.)

 

Students’ Academic Ability

 

Academic ability as measured by grade point average has little relationship to student ratings. Evidently, poor students are just as appreciative of good teaching as abler students, while good students are just as critical of poor teaching as less able students. However, when there is great variety in students’ prior learning and abilities in a course, the instructor may end up concentrating on one group of students to the exclusion of others. In such a situation, the actual quality of teaching varies within the class and will probably be reflected in the ratings.

 

Gender

 

Researchers looking for correlations between ratings and gender have found significant variation, but in both directions. That is, some studies show female faculty receiving higher ratings while others show male faculty receiving higher ratings. In either case, the differences are typically trivial, accounting for less than 2% of the variation in ratings. Female students tend to give slightly higher ratings than male students and some studies have found correlations based on whether student and teacher gender are the same. At UA, female instructors tend to receive higher ratings in most subject areas. If you suspect a systematic pattern of gender bias in ratings for a particular course, please contact OIRPS.

 

Perceived Difficulty, Workload, and Expected Grades

 

The relationship between grades and ratings is complex. The preponderance of research evidence shows a very small positive correlation between ratings and expected grades. There is also some evidence that students will tend to give lower ratings when they expect grades lower than they usually get in other courses, as seen in their G.P.A.

 

A meta-analysis (Cohen, 1981) explored the relationship between overall instructor ratings and student achievement as measured by scores on an independently graded final exam in multiple sections of the same class taught by different instructors. Cohen found that students who received high scores on the final tended to rate their instructors highly (regardless of the instructor), suggesting that successful students tend to credit their instructors for their success.

 

Part 3. Using TCE Results

 

Overview

 

Student ratings data are used both for improving teaching (formative evaluation) and for making personnel decisions (summative evaluation). These two purposes call for somewhat different approaches to the ratings data. When TCE results are used in summative review, it is critical that standards be explicitly defined and consistently applied. This enables evaluators to make fair and reliable judgments while giving those evaluated a clear indication of the standards they will be judged against.

 

TCE Results in Summative Evaluation

 

SELECTING THE BASIS FOR COMPARISON: CRITERION-BASED OR NORM-BASED

 

Department plans for faculty performance appraisal should include an explicit (written) statement of the basis for judging TCE results. Essentially, there are two choices: criterion-based or norm-based. In criterion-based schemes, the performance of individuals is compared with fixed standards (e.g., ratings over 4.5 are deemed "outstanding"). In a strong teaching department, everyone could be deemed outstanding or excellent since individual scores are not affected by the scores of others. In norm-based schemes, the performance of individuals is compared with that of their peers (e.g., the top 10% of ratings are deemed "outstanding"). Norm-based schemes are conceptually similar to grading on the curve in that standards are relative to that of peers rather than absolute.

 

Hybrid approaches with both a criterial and normative elements are common. One hybrid approach is to use a criterion-based system but to set the criteria normatively, based on past performance within the unit. For example, if the unit average for a question is, over time, 4.1 ±.2, a criterion scheme that characterizes scores between 3.9 and 4.3 as "good" is reasonable. Another hybrid approach is to score some questions using absolute criteria but other questions normatively. Some questions typically produce less variable results than others (e.g., the question that reads, "I was treated with respect in this class") and are therefore poor candidates for a normative

treatment. Another hybrid approach might involve using a criterial approach for new hires and a normative approach for merit comparison. (Of course, to ensure fairness, the same standards must be applied to all candidates within the category.)

 

Choosing appropriately among a criteria based and norm-based schemes requires understanding both the properties of the historical data within a unit and the characteristics of ratings as a measure of teaching effectiveness. OIRPS reports support both approaches by providing comparative statistics based on historical data as well as descriptive statistics. However, a normative approach is appropriate only when there is sufficient variation in individual scores (see below). Both approaches require that evaluators have specific quantitative skills; however, normative approaches are somewhat more complex.

 

USING A CRITERION-BASED APPROACH

 

In a criterion-based approach, standards are typically anchor red to the semantic content of the question responses. Therefore, they can usually be considered face-valid (an exception is described in the next paragraph). An example of criteria based on face-valid interpretation of the ratings scales is presented in Table 2, where criteria describe results for single questions. It may be desirable to weight questions differently depending on instructional characteristics of courses or sets of courses. For example, a unit might decide to weight course materials lower than in-class activities, especially if texts are not fully under the control of the instructor. On the other hand, a unit wishing to credit faculty who develop web resources and elaborate course websites might weight course materials higher, whereas a unit wishing to place special emphasis on improving in-class activities might place extra weight on the ratings for that question as an incentive for improvement.

 

Although the distribution of ratings data in most disciplines is approximately "normal" (ratings fall in a bell-shaped curve), the bell is almost always lopsided (in statistical terms, has negative skew). Since the mean will generally be higher than the natural center of the scale, a criterion-based approach that anchors the mid-point of a five-point classification scheme at the middle (3.0) may result in "grade inflation" because of this skew. Even when the midpoint of the scale (3.0) is semantically anchored to the word "average," the true average (i.e. the arithmetic mean) is around 3.6 for many questions in UA courses. In other words, someone rated "average" (3.0)

by every student would in fact have below average ratings.

 

The tendency for ratings distributions to be lopsided suggests that students tend to be lenient raters. In general, effective teachers should have overall teaching effectiveness ratings around 4.0 (usually effective). Ratings around 3.0, sometimes effective, should suggest a need for improvement.

 

USING A NORM-BASED APPROACH

 

To support units using the normative approach, OIRPS summarizes all of the historical data available within the unit to represent the most reliable "true" average over time. The Comparison Group Summaries provide comparison group norms including mean score equivalents for decile ranks to illustrate the distribution of means for each question within each comparison group. To decide whether a norm-based strategy is appropriate for a given unit, check to ensure that the historical data is sufficiently variable to support meaningful distinctions for individual faculty results. In some units, data may support five classifications for individual results (e.g., outstanding, superior, good, fair, poor) based on the overall unit distribution. In other units, data may support only three classifications (e.g., above average, average, below average). In extreme cases, the data may support no statistically significant difference among instructors. In some cases, a difference between an instructor's mean and a comparison mean may be statistically significant in the technical sense, but so small as to preclude meaningful interpretation. For example, it is often the case that a difference smaller than .4 represents no real difference in teaching performance even if it is statistically significant.

 

Also, the closer together scores are in a comparison group, the less likely that differences are meaningful. For example, if all scores within a unit fall between 4.5 and 5.0, with an average score of 4.7, it would be misleading to classify those with scores of less than 4.7 as "below average " since a ratings score of 4.5 is very good under any circumstance. In such cases, a criterion-referenced system is strongly recommended. OIRPS is available to analyze unit data and make recommendations concerning the use of normative statistics or to provide feedback on established unit procedures.

 

TCE Results in Formative Evaluation

 

Short Form results provide sufficient information to determine if teaching is excellent, acceptable, or problematic, but not enough information to determine causes. In general, for effective teachers, the mean for responses to the overall effectiveness question are 4.0 or higher. If the mean for this question is below 4.0, methods for collecting more detailed information are advisable: either a Long Form questionnaire, a mid-semester focus group evaluation, or an expert observation would be a logical next step. See A Short Guide to Evaluating Teaching, Chapter 2 at http://aer.arizona.edu/Teaching/docs/shortGuideEval.pdf - page 8 for further

information about getting formative feedback about teaching effectiveness.


 

Table 2. Sample 5-Point Criterion-Referenced Scheme for Judging TCE Results in Summative Evaluation of Teaching*

TCE Question

Out-standing

Superior

Good

Fair (Improvement Needed)

Unacceptable

5 points

4 points

3 points

2 points

1 point

What is your overall rating of the instructor's teaching effectiveness (almost always/almost never)?

5.0

4.5

4.0

3.5

Less than 3.5

Rate the usefulness of the outside assignments.

5.0

4.5

4.0

3.5

Less than 3.5

Rate the usefulness of the in-class activities.

5.0

4.5

4.0

3.5

Less than 3.5

What is your overall rating of the instructor's teaching effectiveness compared with others?

These questions should be used with caution because their

response scales are worded in comparative terms. There may

be systematic variation in student responses associated with

the personal history of each student. For example, "as good as

usual" may be very good indeed or not depending on what

"usual" was for the respondent. If these questions are used,

perhaps a 3-point category system (exceeds, meets, or fails to

meet expectations) would be more appropriate.

What is your overall rating of this course?

How much do you feel you have learned in this

course?

Rate the materials used in this course.**

5.0

4.5

4.0

3.5

Less than

3.5

I was treated with respect in this class.

5.0

4.5

4.0

3.5

Less than

3.5

Course difficulty

This is not an evaluative question in the sense that difficulty is

not inherently good or bad. Instead, consider if the level of

perceived challenge is appropriate for the particular learners.

* The point values associated with various levels of ratings for each question are provided

as examples for converting ratings data into scores associated with a criterion-based

classification scheme for summative evaluation. An additional column indicating weights for each question is useful if questions are not given equal weight.

** This item should not be used if the instructor had no control over the choice of textbooks

and readings.

 

 

 

 

References

 

                                                    Aleamoni, L.M. (1981) Student ratings of instructor and instruction. In J. Millman (ed.)

                                                    Handbook on teacher evaluation. Beverly Hills, CA: Sage Publications, Inc.

 

                                                    Arreola, R.A. (1995) Developing a Comprehensive Faculty Evaluation System. Bolton, MA:

                                                    Anker Publishing Company.

 

                                                    Braskamp, L.A. and J.C. Ory (1994) Assessing Faculty Work. San Francisco: Jossy-Bass

                                                    Publishers.

 

                                                    Centra, J.A. (1975) Colleagues as raters of classroom instruction. Journal of Higher Education,

                                                    46: 327-337.

 

                                                    Centra, J.A. (1989) Faculty evaluation and faculty development in higher education. In J.C.

                                                    Smart (ed.), Higher Education Handbook of Theory and Research, Vol. 5. New York: Agathon 

                                                    Press.

 

                                                    Centra, J.A. (1993) Reflective Faculty Evaluation. San Francisco: Jossey-Bass Publishers.

                                                    Cohen, P.A. (1981) Student ratings of instruction and achievement: a meta-analysis of

                                                    multisection validity studies. Review of Education Research, 1981, 51, 281-309.

 

                                                    Doyle, K.O. (1975) Student Evaluation of Instruction. Lexington, MA: D.C. Heath and

                                                    Company.

 

                                                    Doyle, K.O. (1983) Evaluating Teaching. Lexington, MA: D.C. Heath and Company.

                                                    Franklin, J. and Theall, M. (1989) Who reads ratings: knowledge, attitude, and practice of users

                                                    of student ratings of instruction. Paper presented at the 1988 Annual Meeting of the American

                                                    Education Research Association, San Francisco, CA. (ERIC Document Reproduction Services

                                                    No. ED 306-241).

 

                                                    Marsh, H. (1992) Student evaluation of university teaching: a multidimensional perspective. In

                                                    J.C. Smart (ed.), Higher Education Handbook of Theory and Research, Vol. 8. New York:

                                                    Agathon Press.

 

                                                    Murray, H.G. (1991) Effective teaching behaviors in the college classroom. In J.C. Smart (ed.),

                                                    Higher Education Handbook of Theory and Research, Vol. 7. New York: Agathon Press.

                                                    Seldin, P. et al. (1999) Changing Practices in Evaluating Teaching. Bolton, Mass.: Anker

                                                    Publishing Co.

 

                                                    Theall, M. et al, eds. (2001) The Student Ratings Debate: Are They Valid? How Can We Best

                                                    Use Them? New Directions for Institutional Research, No. 109. San Francisco: Jossey-Bass

                                                    Publishers.

Blue Line
Teacher-Course Evaluations & Test Scoring
MLK 200 tel. (520) 621-7337,  fax (520) 626-4375
Questions? Comments? tceua@email.arizona.edu
All contents copyright @ 2005 Arizona Board of Regents