![]() |
![]() |
General
TCE Information
Printable version E-Mail
OIRPS TCE Home Page
Part 1. The TCE System
|
Part 2. Understanding TCE Results
Part 3. Using TCE Results
|
Student Ratings at the University of Arizona are called
TCEs – short for Teacher-Course Evaluations. This guide provides assistance in selecting
TCE questionnaires, understanding TCE reports of results, and using TCE results both for
improving teaching and for performance appraisal. Part 1 describes questionnaires, procedures, and
reports of results. Part 2 describes the statistical information contained in the results reports
and covers key concepts for interpreting TCE results. Part 3 offers suggestions for using TCE
results for summative (performance appraisal) and formative (improvement/development)
purposes. OIRPS welcomes your comments and questions. We are committed to making our services
responsive to the needs of faculty and administrators.
Student ratings of instruction, properly constructed and
administered, provide valid and reliable data for improving teaching as well as for documenting
teaching performance for administrative review. This claim is supported by a large body of research
(e.g., Aleamoni, 1981; Centra, 1989; Doyle, 1975; Marsh, 1992; Theall et al, 2001). Student
ratings are currently used in over ninety-five percent of postsecondary
institutions because they are: 1) multidimensional, 2) reliable and stable, 3) relatively valid against a variety of indicators
of effective teaching, and 4) relatively unaffected by a number of variables hypothesized as
possible biases.
Administrators use student ratings data both in making
personnel decisions (summative evaluation) and in mentoring faculty regarding their
effectiveness as teachers (formative evaluation). Faculty use student ratings both for
documenting and for improving teaching. Summative and formative purposes, though related, are
conceptually and practically different in important ways. Most importantly, appropriate use of
ratings data for personnel decision- making requires careful data collection procedures and clear
policies about interpretation, consistently applied. Improper uses of ratings expose units to the
possibility of litigation as well as negatively affecting collegiality and productivity (Franklin and
Theall, 1989).
Although ratings are necessary, they do not provide
sufficient information for a comprehensive evaluation of teaching. Teaching is a multidimensional
activity comprising course planning, classroom instruction, mentoring and advising of students,
assessing student work, etc. Student ratings alone are insufficient because students are unable
to observe or unqualified to judge many aspects of teaching (e.g., the
instructor's content expertise and instructional design skills). Evaluation specialists recommend consideration of at least
two additional sources of information, such as classroom observation reports, self-evaluation
statements, peer assessments of course
materials, and evaluations by instructional specialists
(e.g., Arreola, 1995; Braskamp and Ory, 1994; Centra, 1993; and Theall et al, 2001).
TCE
materials and services are intended to help faculty and
administrators get the most benefit from TCE reports. In addition to this guide, A Short
Guide to Evaluating Teaching
http://aer.arizona.edu/Teaching/docs/ShortGuide.pdf
addresses the broader dimensions of evaluating teaching, covering multiple methods in
addition to student ratings, and illustrating how student ratings data may be integrated with information
from other sources. It is addressed both to individual faculty and to
department administrators and P&T committees.
Standard TCE services are available at no cost for all UA
academic units except those in the College of Medicine. The Extended University supports TCE
services for Summer and Winter Sessions, and Evening and Weekend Campus. Online TCE
service is available for courses with significant instructional computing components, such as
distance education courses. Contact OIRPS at 621-9585 for more information about online
TCEs.
TCE services are available to all UA instructors for courses
with five or more enrollment and are
normally ordered through the instructor’s academic unit. TCE packets are not
created for courses with less than five enrollment because "ratings
of courses based on five or fewer completed forms are of questionable
reliability and validity" (Braskamp and Ory, 1994, p188).
Each participating unit has a TCE “contact” who coordinates TCE materials. (See Contacts [http://aer.arizona.edu/Teaching/Contacts/contacts.asp].) Each
semester, the contact identifies faculty and courses for
which TCE materials will be prepared. Contacts should inform faculty in writing about their
unit’s policy on participation in the TCE process, who sees the results, and how the results are
used. Faculty should provide contacts with updates on course information (enrollment changes; special
circumstances that might affect the TCE process) as well as requests for particular
questionnaires. It is critical that department contacts provide OIRPS
with accurate information or
instructors may not receive the proper forms in a timely manner.
http://aer.arizona.edu/questionnaires/quesmain.asp
SHORT FORM AND LONG FORM
Most University of Arizona faculty may choose either the Short Form or the Long Form. The Short Form contains a small core of eleven global questions suitable for use in summative evaluation along with six questions about student demographics. The Long Form contains the same core and demographic questions plus more specific questions designed to provide detailed feedback. We are currently developing a system that will allow faculty to “customize” the Short Form by adding questions on its back. Check the OIRPS website (http://aer.arizona.edu/) for updates on this option.
When ratings results are to be used for summative purposes
(performance appraisal), a small set of global questions provide an adequate basis for judgment
while minimizing evaluator labor (Seldin, 1999; Theall et al, 2001). For formative purposes
(development/improvement), a more detailed and comprehensive questionnaire is desirable.
OIRPS
reconciles these needs by providing the choice of forms. Regardless of the form chosen, only
results for the core questions are provided to department heads and made available to the UA
community. Results for other questions, as well as student written comments, are
considered direct feedback to instructors inappropriate for consideration in performance appraisal.
(Reports for graduate teaching assistants (GTAs) are treated differently; see below.)
Results from the core questions can be used to determine
the overall effectiveness of an instructor, but they provide little in the way of
specifics. Instructors satisfied with their ratings may wish to use the Short Form for most classes and the
Long Form occasionally. Instructors actively involved in developing their teaching may want to
use the Long Form or customization option more frequently.
GTA
FORMS
Two forms have been developed for lab or discussion sections attached to lecture classes. These sections are most commonly taught by GTAs. Like the Long Form, they contain questions regarding instructor behaviors to provide adequate data for developing/improving teaching. However, the full complement of core questions is NOT included because several of the questions are unsuitable for attached sections (e.g., the value of the outside assignments). Results for GTAs are sent to the GTA’s unit head.
The Team Form is similar to the Short Form, except that the words “the instructor” are replaced by “the instructional team.” It is most appropriate for “simultaneous” team- taught courses, where two or more instructors co-teach the course throughout the semester. In this case, all instructors attend class meetings and more than one instructor may be responsible for a single class session.
The drawback of using the Team Form is that instructors will not have individual TCE results to put forward for administrative review. Instructors teaching in simultaneous teams may request that a separate TCE form be prepared for each instructor. The drawback to this approach is that students must fill out two or more forms for the same course, even though only a few questions ask about the instructor.
OIRPS recognizes that neither option is entirely satisfactory and is aware that team-teaching is a growing trend. We are developing a questionnaire that will enable students to separately evaluate the members of a teaching team without duplicating other questions.
The Team Form is less appropriate for “consecutive” teams, in which one or more instructors teach parts of the course consecutively: e.g., instructor A teaches the first month, instructor B the second month, instructor C the third month. In this case, rating the effectiveness of “the instructional team” is less appropriate. It is also problematical to wait until the end of the semester to evaluate instructors A and B, especially the former. With adequate advance notice, OIRPS can provide TCEs earlier in the semester, so that a TCE can be conducted for instructor A at the end of instructor A’s teaching responsibility. Contact OIRPS for more information.
In addition to the Long and Short Forms, OIRPS offers a number of custom questionnaires, most of which have been developed for individual academic units. Contact OIRPS.
OVERALL
QUESTIONS
Almost every questionnaire includes four "overall" questions:
1. What is your overall rating of this instructor's teaching effectiveness?
2. What is your overall rating of this course?
3. How much do you feel you have learned in this course?
4. What is your rating of this instructor compared with other instructors you have had?
These are single questions, not composite or average scores based on other questions. Overall questions are recommended for use in performance appraisal because they are applicable across the wide variety of teaching styles and course formats (Theall et al, 2001).
The overall instructor, course, and amount learned questions are usually highly inter-correlated. Occasionally, instructors receive somewhat higher ratings than their courses. This may happen with an unpopular required course where students recognize the teacher’s efforts but still feel negative about the course.
The comparative instructor question (4) is somewhat problematical since “compared to other instructors you have had” may be interpreted differently by different students (ALL other instructors? Other university instructors? Other instructors in the major?) Often, instructors are rated effective in their own right (per the overall instructor question), but less effective when compared to some reference set of “other instructors.” Instructors whose comparative effectiveness rating is as high as their overall effectiveness rating deserve congratulations. However, OIRPS recommends that only the overall instructor effectiveness question be used in summative evaluation.
WORKLOAD, VALUE, AND
DIFFICULTY QUESTIONS
Virtually all questionnaires include four questions related to workload and value of work: “amount learned,” “difficulty level of the course,” “total hours spent on class-related work,” and “value of hours spent on this class.” These questions gauge the level of perceived challenge and provide clues about student motivation.
Analysis of ratings at UA (and nationwide) show that workload and difficulty per se have little if any relationship to overall ratings of instructors or courses (Theall et al, 2001). A course may be rated high in difficulty or high in preparation time for positive reasons (it was fast-paced and challenging) or negative ones (poor organization, confusing assignments). On the other hand, the question asking how many preparation hours students considered valuable to their education has a moderately strong positive association with overall ratings, particularly in applied fields such as engineering and business. Not surprisingly, the higher the number of hours considered valuable, the higher the course and instructor ratings are likely to be.
Course
questions explore aspects of course design and organization, such as outside
assignments, in-class
activities, and class materials (texts, websites, etc.). Course question
results can highlight problems
that may be outside an instructor's direct control, such as a poor text. In
such cases, the course
question results may be relevant to personnel decision- making in the sense of
“mitigating circumstances”
and should be discussed in the narrative faculty prepare for promotion and
tenure portfolios.
All questionnaires include questions designed to obtain self- reports and demographic information, useful in understanding student responses.
The
Long, Discussion, Lab, and most custom questionnaires include a series of
questions about teaching behaviors which comprise a multi- dimensional
profile of teaching (organization and delivery of instructional content, classroom interaction
and rapport, classroom management, testing and feedback).
TIMING
TCEs are normally administered as paper questionnaires
distributed within classes. (See below for information about online
administration.) The standard TCE administration period is usually the three
weeks preceding finals. It is generally best not to wait until the last day of
class for administration. For best results, administer the TCE at the beginning
of class on a day when attendance will be at a maximum. Do not tell students
beforehand what day the evaluation will take place.
ONLINE ADMINISTRATION
At this time, TCEs are administered online only for
distance education courses and courses taught in computer labs where a workstation is available
for each student. OIRPS is considering possible pilots for a more widespread
electronic system. If you would be interested in participating in an online
pilot, please contact Gwen Johnson (qwj@u.arizona.edu).
INSTRUCTIONS FOR ADMINISTRATION OF PAPER QUESTIONNAIRES
TCE results can be compromised by improper administration of questionnaires. The instructions below are intended to ensure fairness and anonymity. Failure to follow these instructions may result in the TCE being declared invalid. Please contact OIRPS if you have a problem with this procedure or if you think your student monitor has failed to follow instructions.
1. Inspect the contents of your TCE packet when you first receive it. If there are any errors (quantity, form, name, etc.), ask your department
contact to request new materials immediately.
2. Select a date for administering the TCE.
3. Select a student monitor. The monitor must be a member of the class, chosen the day the TCE is administered. Do not use your TA or unit staff
as monitors.
4. Discuss the TCE process with the monitor. If you are adding your own materials to the packet, make sure your monitor knows to insert them along with the TCEs. Remind the monitor to follow the step-by-step instructions on the monitor packet. Your monitor should give you a yellow monitor card after turning in the questionnaires.
5. Encourage students to offer written comments. Tell students that you will not see the TCE results until after final course grades have been posted. You may also suggest that students disguise their handwriting.
6. Allow the monitor to administer the TCEs. Leave the room and remain out of the room while the TCE is being administered, returning only after the monitor has placed the materials in the packet and sealed it according to instructions.
7. Point out the official TCE drop box locations listed on the front of the monitor packet to make sure that the monitor knows where to deliver the completed materials.
8. Obtain the completed yellow monitor card receipt from your monitor. This is your "receipt" documenting that you used a monitor and who that
monitor was.
INSTRUCTOR REPORTS
Beginning fall 2002, instructors will be able to review and
print copies of their own reports by clicking on the Instructor Reports button on the
OIRPS
website (http://aer.arizona.edu/). Written comments will be placed in individually addressed envelopes
marked “confidential,” to be picked up by the department contact early in spring
semester for fall results and early in June for spring results. Unit heads are provided with summaries
of the core question results for each course as well as a set of reports summarizing results for
groups of courses (the Comparison Group Summaries). Results for questions other than core
questions are considered confidential and provided only to instructors, as are the student
written comments. OIRPS recommends against using student written comments in summative
evaluation without stringent safeguards.
For discussion, see Using Student Written Comments in
Summative Evaluation
[http://aer.arizona.edu/Teaching/docs/shortGuide.pdf
- page=32].
In addition to the standard reports for individual
sections, instructors will be able to view and print two additional reports, the TCE History and the TCE
Overall Effectiveness Graphics.
Individual Section Reports : Reports list results for core questions on the first page
and results for other questions on one or more subsequent pages
depending on the questionnaire chosen. The final page, the TCE Comparison Report, provides comparison
group statistics and a graphic comparison between the instructor’s means for the core
questions and the comparison group means for those questions. (This is discussed below under
“Comparison Groups.”) Since most undergraduate courses are in two comparison groups, two
graphs are provided, along with comparison group statistics. (Graduate courses are
typically part of only one comparison group.) If no comparison data is available, no graphic reports are
provided. Click to see an example of the report at http://aer.arizona.edu/Teaching/Reports/section.htm.
TCE History: The TCE
History is a listing of courses taught by the candidate for which a TCE was
requested. It may be organized either chronologically or chronologically within
levels (grad, upper division, lower division). Courses where no TCE was
requested are not listed. To see a sample TCE History at http://aer.arizona.edu/Teaching/reports/history.htm.
TCE Comparison Reports: The
TCE Comparison report is the final page of the Individual Section Report, providing comparison group statistics and a
graphic comparison between the instructor’s means for the core questions and the
comparison group means for those questions (see above). Instructors may print a set of Comparison
Reports for all courses taught with the past five years independently of their Individual Section
Reports. This set of reports may be useful in compiling materials for administrative reviews.
Overall Effectiveness Graphics: The TCE Overall Effectiveness Graphic provides an
at-a-glance summary of how an individual’s courses within each comparison group
compare to other courses in the comparison group. A separate TCE Overall Effectiveness Graphic is provided for
each comparison group. Each TCE Overall Effectiveness
Graphic shows how the instructor’s combined results for the overall teaching effectiveness
question compares to results for other courses in the same subject area at the same level. Each
graphic provides a description of the comparison group, a description of the instructor’s sample,
and a graphic comparing the confidence interval for the overall teaching effectiveness
question for the comparison group to the confidence interval for the instructor’s combined set of
courses, as well as for each course individually. For more information, see our Guide to the TCE Overall Effectiveness
Graphic
[http://aer.arizona.edu/Teaching/docs/oegraphic.pdf].
Problem Reports: If a
problem occurs in administering a TCE or if a TCE packet is not returned for processing, a green “Problem Report” is sent,
with instructions for requesting a resolution (see Problem Report Codes at http://aer.arizona.edu/Teaching/reports/problems.htm).
Some problems can be remedied and a report released. However, because of the
number of TCEs conducted, problem resolution requests must be made by May 1 of the following semester
for fall courses and by December 1 of the following semester for spring courses.
ADMINISTRATOR REPORTS
Beginning fall 2002, department contacts may review online
and print a variety of reports that allow unit heads to overview department teaching
performance. (These replace the reports sent to unit heads at the end of each semester.) Three reports are
available, described below.
Comparison Group List:
a list of comparison groups within the unit showing the number of sections and responses for each group along with
descriptive statistics for the Overall Effectiveness Question for that group. You can see an
example at http://aer.arizona.edu/Teaching/reports/complist.pdf.
Comparison Group Summaries: a separate summary report is provided for each comparison group listed on the Comparison Group List. Each Comparison
Group Summary shows the mean at each decile for each of the core questions.
Collectively, the Comparison Group Summaries show the range of variation across
comparison groups in the unit. The Comparison Group Summaries make it possible
to review TCE results for sets of courses and get a sense of the overall
student perception of these courses. This report can be used to determine
whether data in a comparison group is sufficiently variable to allow meaningful
distinction among faculty based on ratings. It can also be used to document
overall unit performance for unit or program reviews. You can see an example at
http://aer.arizona.edu/Teaching/reports/compsummary.pdf.
TCE Comparison Reports: The
TCE Comparison report is the final page of the Individual Section Report, providing comparison group statistics and a
graphic comparison between the instructor’s means for the core questions and the
comparison group means for those questions. TCE Comparison Reports are available for all courses in the
unit taught in a given semester. You can see an example at http://aer.arizona.edu/Teaching/reports/compar.pdf.
RATINGS RESULTS REPORT
Under Faculty Senate mandate, OIRPS publishes TCE results on the UAWeb at https://aer.arizona.edu/ASUA/. This document, updated each semester, is accessible to anyone in the UA community with a valid u.arizona.edu email account. The Ratings Results Report is designed primarily for student use, so it summarizes only the frequency of response to the core questions and the percentage TCE response. This report is not suitable for use in performance appraisal because it provides no estimate of error and no statistical basis for comparison to similar courses.
Comparison groups have been established to enable comparison of an instructor’s ratings to ratings for similar courses. Most TCE reports include a “TCE Comparison Report” as their final page. This report provides comparison group statistics and a graphic comparing the instructor’s means for the core questions and the comparison group means for those questions. Comparison groups are based on 1) subject, 2) course format (lab, studio, lecture, etc.), 3) level (lower division, upper division, graduate), and 4) number of students enrolled (below 5, 5-19, 40-59, 60 and above). This means that courses are compared only to courses with the same subject prefix, at the same level. Undergraduate courses are additionally compared to courses with the same subject prefix and level that are of comparable size. Disciplinary area, level, and size are factors known to be associated with systematic variation in ratings, although taken together, they rarely account for more than 5%-10% of total variance. See Systematic Variation in Ratings, below, for more information.
The usefulness of comparison data depends in part on the amount of data available. This is a function of both the number of comparable courses in the subject area and the number of semesters for which data is available. Reports for semesters prior to spring 2000 include %Rank and T scores when the comparison group has 50 or more sections. These statistics are omitted in spring 2000 and subsequent reports. If the comparison group has at least five sections, the comparison group mean, standard deviation, median, and 95% confidence interval are given. (See Reading the Results in Part 2 of this report.)
As more data becomes available for a unit, more focused comparison groups can be formed. Ideally, a comparison group includes courses of similar size, level, and status of student enrollment (required versus elective). Comparisons involving course format; teacher variables such as rank, years teaching, gender, etc.; and student variables can be made if systematic variation in ratings is found to be associated with such factors. (For more information, see Systematic Variation in Ratings in Part 2 of this report.) In general, small units will have fewer comparison groups; however, over time, more comparisons become possible.
To learn more about the following special TCE services, contact OIRPS.
· creating customized forms for special assessment needs
· receiving individual or group training in the interpretation and use of student ratings for
teaching improvement and/or administrative review processes
· including other questionnaires to be administered at the same time as the TCE
questionnaire
Part 2. Understanding TCE Results
Understanding student ratings requires an understanding of statistical concepts related to sampling, significance, and precision as well as an understanding of the characteristics of ratings as a measure of teaching performance. Because student ratings statistics do not have the precision typical of statistics in the sciences, it is always important to interpret them in the context of individual and unit patterns. OIRPS offers workshops and consultation on interpreting these statistics and using TCE results appropriately, as well as on other aspects of evaluating teaching.
The primary units of analysis in TCE reports are individual student responses within individual sections. Many reports also show summaries of results for the same questions from sets of similar courses. This section describes the statistics used in the reports and offers suggestions for their interpretation, along with information about general characteristics of student ratings. OIRPS recommends a three-step procedure for reviewing TCE reports:
Step 1: Check the sample
Step 2: Review individual results
Step 3: Review comparison statistics
Before using ratings, it is important to know how representative the available data is. Standards for samples depend on how the ratings will be used: they should be most stringent when ratings are used in performance review.
For responses in a section to be meaningful for decision- making purposes, they must be representative of the entire class. Information about the sample is printed at the top of each TCE report: number enrolled, number responding, and percent responding. Use Table 1 to decide whether enough students responded for the sample to be usable.
The higher the proportion of respondents to those enrolled, the more reliable the results. In general, sections with less than a 50% response rate should not be used for performance appraisal. The smaller the class, the higher the percentage of responses needed to ensure that the sample is representative.1 If the non-response rate seems high, there may be a systematic reason for student absence that might bias results. For example, if ratings are administered the day of a review session when attendance is optional, students for whom instruction has been most effective may be excluded.
If only a small fraction of students respond, the responses can only be considered the opinions of those few students – though it may be tempting to generalize if they are positive.
1 Interrater
reliability coefficients (a measure of agreement within a class compared with
agreement among raters across many
classes) are typically in the .70s for 10 student
raters, in the .80s for 15 raters, and in the .90s for 20 or more students. It
is
interesting to note that interrater reliability for
student ratings is much higher than for peer ratings. In one study (Centra,
1975),
the interrater reliability for
15 faculty colleagues was .57, a dramatic contrast with .85 for 15 students.
|
Table 1: Guidelines for Judging
Samples Within Sections |
|
|
Class size |
Recommended response % |
|
5-20 |
at least 80%, more recommended |
|
20-30 |
at least 75%, more recommended |
|
30-50 |
at least 66%, 75% or more
recommended |
|
50-100 |
at least 60%,75% or more
recommended |
|
100 or more |
more than 50%, 75% or more
recommended |
While the results from a single administration of a TCE questionnaire, particularly a long questionnaire, can provide useful information, such results apply to the course as one event in time only. Averaged results from comparable courses taken over several evaluations (each with an adequate sample of response) are more likely to fairly represent teaching ability. A minimum of five courses is recommended. It is also important to ensure that the courses selected are representative. If an instructor’s teaching load is half graduate courses and half undergraduate courses, the sample presented for review should be about half graduate and half undergraduate courses. Most importantly, no single score or set of scores from a single section should be used for judging teaching performance for performance appraisal.
SAMPLE QUALITY OF COMPARISON GROUPS
Questions to ask about comparison groups include:
1) Are the courses in the comparison group reasonably comparable in content, size, and instructional methods?
2) Are there enough courses in the comparison group?
3) Were a significant number of courses that met the selection criteria for the comparison group not included because their instructors did not participate or because insufficient student response, lack of documented student monitoring, or other errors invalidated the data?
4) How many different instructors taught courses included in the comparison group?
FREQUENCIES
AND PERCENT OF VALID RESPONSE
For each question, the distribution of student responses across the possible response choices is given in frequency of responses per option and percent of valid responses per option. Interpreting the data is largely common sense – how many students "said" what, in terms of the available response options for each question. Usually, students are in fairly good agreement in their ratings and scores cluster around two or three adjacent options.
For positively stated questions representing behaviors associated with effective teaching, it is desirable for responses to cluster in the first two options, "almost always" and "usually." If a significant percentage of students respond "sometimes," "rarely," or "almost never," the question points to an area of teaching skill that likely needs attention. Responses should cluster similarly for questions with response scales worded "very useful" to "nearly useless."
For questions with normatively worded response options such as "among the best" to "among the worst," more caution is needed since the basis for comparison is unknown. For example, if a student has taken only exceptionally well-taught courses, a moderately well-taught course might seem poor by comparison.
MEANS, MEDIANS, AND STANDARD
DEVIATIONS
Means and medians are measures of central tendency, showing the "middle" of a set of scores. The standard deviation (SD) is a measure of how variable scores are, i.e. how spread out they are around that "middle." Means and SDs appear on all reports in both section data and comparison data. Medians appear only in comparison data.
The mean for a question is the arithmetic average of student responses. For most TCE questions, means can range from 1 to 5. Most questions are reverse scaled: that is, the most positive option, "A," is scored as 5 points. The "Key" on each question tells how individual questions were scored.
The SD gives an approximate measure of agreement or disagreement among raters. Perfect agreement would yield an SD of 0. In a typical class, about two thirds of ratings fall within one rating point above or below the mean and the SD is 1.0 or less. If the SD for a question scaled with 5 points is higher than 1.2, the mean is not a good measure of student response.
High SDs occur when opinion in a class is strongly divided between very high and very low ratings or when opinion is dispersed across the entire response scale. Because students and teachers vary, it is possible for a teacher to be "among the best" for some and "among the worst" for others. In such cases, the mean does not represent a "typical" student opinion in any meaningful sense. (Consultation to explore the source(s) of consistently high SDs is available from OIRPS.)
CONFIDENCE
INTERVALS
Most OIRPS reports show a 95% confidence interval (CI) in parentheses to the right of the section means and comparison group means. While the SD gives an approximate measure of the amount of disagreement among students, the 95% CI shows the impact of the disagreement on the precision of the mean as a way of summarizing responses.
The 95% CI is similar to the "margin of error," a familiar feature of opinion polls which assigns a value, plus or minus, within which the "true" score occurs once all sources of error and disagreement are taken into account. There is a 95% chance that the true score for a question occurs somewhere in the interval between the two values.
Reviewing the Comparison Statistics
For spring 2000 and subsequent reports, comparison group statistics appear on the final page along with one or more graphics showing how results for the section compare with results for the comparison group. (This page is titled “TCE Comparison Report.”) For reports issued prior to spring 2000, statistics for the comparison group appear on the Short Report in the column labeled "Comparison Group" (between the section statistics and the columns showing T scores and Percentile Rank Groups (%Rank)).
Comparison group descriptive statistics include the number of sections in the comparison group, the grand mean and its 95% CI, and the median of section means for each question. A comparison group mean is the grand mean of a set of section means, not the mean of student responses pooled across the sections. Similarly, the comparison group SD is the deviation of the section means. The median is the halfway point: half of all the means in the comparison group fall above the median and the other half below.
Systematic
Variation in Ratings
Although properly administered student ratings are quite dependable, research shows that there are predictable sources of systematic variation or even bias which should be considered when comparing scores. To address potential concern about three factors known to cause systematic variation in ratings (disciplinary differences, course level and course size), we have based our comparison groups on these variables. As our database grows, other factors may be taken into account. However, research shows that taken together, all the sources of variation listed typically account for less than 5% of variation in overall instructor ratings.
Disciplinary Differences
Significant differences between ratings of courses in different disciplines are well documented. For example, courses in the humanities and fine arts tend to be rated more highly than those in physical and applied sciences. For this reason, most sources agree that ratings should not be compared across disciplines. (If cross-disciplinary comparisons of faculty are necessary, faculty standings within their own comparison groups can be compared.) Unless faculty have recommended combining similar subject areas, our reports always restrict comparisons to the subject area defined by the course subject code, e.g., ANTH, MUSI, POL, etc.
Course Level
Lower division students tend to give the lowest ratings; graduate students tend to give the highest ratings.
Class Size
Small classes (fewer than 20 students) tend to receive the highest ratings, while large classes (40-100) tend to receive the lowest ratings. (Extremely large classes (over 100 students) tend to receive intermediate ratings, which suggests that students may have different criteria for evaluating them.)
Course Status
Students tend to give electives and courses in their majors slightly higher ratings than courses taken to fulfill a college or general education requirement. Research shows a correlation between overall ratings and whether the course is required versus elective. "Required ness" was negatively associated with overall ratings of instructor (-.15), amount learned (-.18), and course (-.23).
Semester or Summer Session
Summer Session ratings, on average, are significantly higher than fall or spring ratings for comparable courses at UA. Thus, unless otherwise noted, comparison groups do not include Summer Session data.
Course Content
Differences in ratings are occasionally associated with course content. For example, courses with quantitative content may receive slightly lower ratings than other courses at the same level in the same subject area. Similarly, courses that challenge strongly held beliefs may receive lower ratings from some students.
Years of Teaching Experience
Instructors with less than one year of experience tend to receive the poorest ratings. Teachers with between three and twelve years experience tend to receive the best ratings, while those with more than twelve years tend to receive intermediate ratings.
Improper Administration of Questionnaires
Student ratings can be biased by failure to adhere to instructions for administering the questionnaire, such as failure of the instructor to leave the room during administration, failure to preserve student anonymity, administration of the evaluation during finals, and use of prejudicial introductory remarks. (The TCE monitoring system is a strategy to minimize such problems.)
FACTORS
THAT HAVE LITTLE IF ANY INFLUENCE
ON RATINGS
Scheduling Factors
Time of day and other scheduling factors appear to have little or no influence on ratings. (However, systematic differences in who attends classes at particular times could theoretically have some impact on ratings.)
Students’ Academic Ability
Academic ability as measured by grade point average has little relationship to student ratings. Evidently, poor students are just as appreciative of good teaching as abler students, while good students are just as critical of poor teaching as less able students. However, when there is great variety in students’ prior learning and abilities in a course, the instructor may end up concentrating on one group of students to the exclusion of others. In such a situation, the actual quality of teaching varies within the class and will probably be reflected in the ratings.
Gender
Researchers looking for correlations between ratings and gender have found significant variation, but in both directions. That is, some studies show female faculty receiving higher ratings while others show male faculty receiving higher ratings. In either case, the differences are typically trivial, accounting for less than 2% of the variation in ratings. Female students tend to give slightly higher ratings than male students and some studies have found correlations based on whether student and teacher gender are the same. At UA, female instructors tend to receive higher ratings in most subject areas. If you suspect a systematic pattern of gender bias in ratings for a particular course, please contact OIRPS.
Perceived Difficulty, Workload, and Expected
Grades
The relationship between grades and ratings is complex. The preponderance of research evidence shows a very small positive correlation between ratings and expected grades. There is also some evidence that students will tend to give lower ratings when they expect grades lower than they usually get in other courses, as seen in their G.P.A.
A meta-analysis (Cohen, 1981) explored the relationship between overall instructor ratings and student achievement as measured by scores on an independently graded final exam in multiple sections of the same class taught by different instructors. Cohen found that students who received high scores on the final tended to rate their instructors highly (regardless of the instructor), suggesting that successful students tend to credit their instructors for their success.
Part
3. Using TCE Results
Student ratings data are used both for improving teaching (formative evaluation) and for making personnel decisions (summative evaluation). These two purposes call for somewhat different approaches to the ratings data. When TCE results are used in summative review, it is critical that standards be explicitly defined and consistently applied. This enables evaluators to make fair and reliable judgments while giving those evaluated a clear indication of the standards they will be judged against.
TCE Results in Summative Evaluation
SELECTING THE BASIS FOR COMPARISON: CRITERION-BASED OR NORM-BASED
Department plans for faculty performance appraisal should include an explicit (written) statement of the basis for judging TCE results. Essentially, there are two choices: criterion-based or norm-based. In criterion-based schemes, the performance of individuals is compared with fixed standards (e.g., ratings over 4.5 are deemed "outstanding"). In a strong teaching department, everyone could be deemed outstanding or excellent since individual scores are not affected by the scores of others. In norm-based schemes, the performance of individuals is compared with that of their peers (e.g., the top 10% of ratings are deemed "outstanding"). Norm-based schemes are conceptually similar to grading on the curve in that standards are relative to that of peers rather than absolute.
Hybrid approaches with both a criterial and normative elements are common. One hybrid approach is to use a criterion-based system but to set the criteria normatively, based on past performance within the unit. For example, if the unit average for a question is, over time, 4.1 ±.2, a criterion scheme that characterizes scores between 3.9 and 4.3 as "good" is reasonable. Another hybrid approach is to score some questions using absolute criteria but other questions normatively. Some questions typically produce less variable results than others (e.g., the question that reads, "I was treated with respect in this class") and are therefore poor candidates for a normative
treatment. Another hybrid approach might involve using a criterial approach for new hires and a normative approach for merit comparison. (Of course, to ensure fairness, the same standards must be applied to all candidates within the category.)
Choosing appropriately among a criteria based and norm-based schemes requires understanding both the properties of the historical data within a unit and the characteristics of ratings as a measure of teaching effectiveness. OIRPS reports support both approaches by providing comparative statistics based on historical data as well as descriptive statistics. However, a normative approach is appropriate only when there is sufficient variation in individual scores (see below). Both approaches require that evaluators have specific quantitative skills; however, normative approaches are somewhat more complex.
USING A CRITERION-BASED
APPROACH
In a criterion-based approach, standards are typically anchor red to the semantic content of the question responses. Therefore, they can usually be considered face-valid (an exception is described in the next paragraph). An example of criteria based on face-valid interpretation of the ratings scales is presented in Table 2, where criteria describe results for single questions. It may be desirable to weight questions differently depending on instructional characteristics of courses or sets of courses. For example, a unit might decide to weight course materials lower than in-class activities, especially if texts are not fully under the control of the instructor. On the other hand, a unit wishing to credit faculty who develop web resources and elaborate course websites might weight course materials higher, whereas a unit wishing to place special emphasis on improving in-class activities might place extra weight on the ratings for that question as an incentive for improvement.
Although the distribution of ratings data in most disciplines is approximately "normal" (ratings fall in a bell-shaped curve), the bell is almost always lopsided (in statistical terms, has negative skew). Since the mean will generally be higher than the natural center of the scale, a criterion-based approach that anchors the mid-point of a five-point classification scheme at the middle (3.0) may result in "grade inflation" because of this skew. Even when the midpoint of the scale (3.0) is semantically anchored to the word "average," the true average (i.e. the arithmetic mean) is around 3.6 for many questions in UA courses. In other words, someone rated "average" (3.0)
by every student would in fact have below average ratings.
The tendency for ratings distributions to be lopsided suggests that students tend to be lenient raters. In general, effective teachers should have overall teaching effectiveness ratings around 4.0 (usually effective). Ratings around 3.0, sometimes effective, should suggest a need for improvement.
USING A NORM-BASED
APPROACH
To support
units using the normative approach, OIRPS summarizes all of the historical data available
within the unit to represent the most reliable "true" average over
time. The Comparison Group
Summaries provide comparison group norms including mean score equivalents for
decile ranks
to illustrate the distribution of means for each question within each
comparison group. To
decide whether a norm-based strategy is appropriate for a given unit, check to
ensure that the historical
data is sufficiently variable to support meaningful distinctions for individual
faculty results.
In some units, data may support five classifications for individual results
(e.g., outstanding,
superior, good, fair, poor) based on the overall unit distribution. In other
units, data may
support only three classifications (e.g., above average, average, below
average). In extreme cases,
the data may support no statistically significant difference among instructors. In
some cases, a difference between an instructor's mean and a comparison mean may
be statistically significant in the technical sense, but so
small as to preclude meaningful interpretation. For example, it is often the case that a
difference smaller than .4 represents no real difference in teaching performance even if it is
statistically significant.
Also, the closer together scores are in a comparison group,
the less likely that differences are meaningful. For example, if all scores within a unit fall
between 4.5 and 5.0, with an average score of 4.7, it would be misleading to classify those with
scores of less than 4.7 as "below average " since a ratings score of 4.5 is very good
under any circumstance. In such cases, a criterion-referenced system is strongly recommended.
OIRPS is
available to analyze unit data and make recommendations concerning the use of normative
statistics or to provide feedback on established unit procedures.
Short Form results provide sufficient information to
determine if teaching is excellent, acceptable, or problematic, but not enough information to
determine causes. In general, for effective teachers, the mean for responses to the overall
effectiveness question are 4.0 or higher. If the mean for this question is below 4.0, methods for
collecting more detailed information are advisable: either a Long Form questionnaire, a
mid-semester focus group evaluation, or an expert observation would be a logical next step. See A Short
Guide to Evaluating Teaching, Chapter 2 at http://aer.arizona.edu/Teaching/docs/shortGuideEval.pdf
- page 8
for further
information about getting
formative feedback about teaching effectiveness.
|
Table 2. Sample
5-Point Criterion-Referenced Scheme for Judging TCE Results in Summative
Evaluation of Teaching* |
||||||
|
TCE Question |
Out-standing |
Superior |
Good |
Fair
(Improvement Needed) |
Unacceptable |
|
|
5 points |
4 points |
3 points |
2
points |
1 point |
||
|
What is your overall rating of the instructor's
teaching effectiveness (almost always/almost never)? |
5.0 |
4.5 |
4.0 |
3.5 |
Less than 3.5 |
|
|
Rate the usefulness of the outside assignments. |
5.0 |
4.5 |
4.0 |
3.5 |
Less than 3.5 |
|
|
Rate the usefulness of the in-class activities. |
5.0 |
4.5 |
4.0 |
3.5 |
Less than 3.5 |
|
|
What
is your overall rating of the instructor's teaching effectiveness compared
with others? |
These
questions should be used with caution because their response
scales are worded in comparative terms. There may be
systematic variation in student responses associated with the
personal history of each student. For example, "as good as usual"
may be very good indeed or not depending on what "usual"
was for the respondent. If these questions are used, perhaps
a 3-point category system (exceeds, meets, or fails to meet
expectations) would be more appropriate. |
|||||
|
What is your overall rating of this course? |
||||||
|
How much do you feel you have learned in this course? |
||||||
|
Rate
the materials used in this course.** |
5.0 |
4.5 |
4.0 |
3.5 |
Less than 3.5 |
|
|
I
was treated with respect in this class. |
5.0 |
4.5 |
4.0 |
3.5 |
Less than 3.5 |
|
|
Course
difficulty |
This
is not an evaluative question in the sense that difficulty is not
inherently good or bad. Instead, consider if the level of perceived
challenge is appropriate for the particular learners. |
|||||
|
*
The point values associated with various levels of ratings for each question
are provided as
examples for converting ratings data into scores associated with a
criterion-based classification
scheme for summative evaluation. An additional column indicating weights for
each question is useful if questions are not given equal weight. **
This item should not be used if the instructor had no control over the choice
of textbooks and readings. |
||||||
Aleamoni,
L.M. (1981) Student ratings of instructor and instruction. In J. Millman (ed.)
Handbook on teacher evaluation. Beverly Hills, CA: Sage Publications, Inc.
Arreola, R.A. (1995) Developing a Comprehensive Faculty Evaluation System. Bolton, MA:
Anker Publishing Company.
Braskamp, L.A. and J.C. Ory (1994) Assessing Faculty Work. San Francisco: Jossy-Bass
Publishers.
Centra, J.A. (1975) Colleagues as raters of classroom instruction. Journal of Higher Education,
46: 327-337.
Centra, J.A. (1989) Faculty evaluation and faculty development in higher education. In J.C.
Smart (ed.), Higher Education Handbook of Theory and Research, Vol. 5. New York: Agathon
Press.
Centra, J.A. (1993) Reflective Faculty Evaluation. San Francisco: Jossey-Bass Publishers.
Cohen, P.A. (1981) Student ratings of instruction and achievement: a meta-analysis of
multisection validity studies. Review of Education Research, 1981, 51, 281-309.
Doyle, K.O. (1975) Student Evaluation of Instruction. Lexington, MA: D.C. Heath and
Company.
Doyle, K.O. (1983) Evaluating Teaching. Lexington, MA: D.C. Heath and Company.
Franklin,
J. and Theall, M. (1989) Who reads ratings: knowledge, attitude, and
practice of users
of student ratings of instruction. Paper presented at the 1988 Annual Meeting of the American
Education Research Association, San Francisco, CA. (ERIC Document Reproduction Services
No. ED 306-241).
Marsh, H. (1992) Student evaluation of university teaching: a multidimensional perspective. In
J.C. Smart (ed.), Higher Education Handbook of Theory and Research, Vol. 8. New York:
Agathon Press.
Murray, H.G. (1991) Effective teaching behaviors in the college classroom. In J.C. Smart (ed.),
Higher Education Handbook of Theory and Research, Vol. 7. New York: Agathon Press.
Seldin, P. et al. (1999) Changing Practices in Evaluating Teaching. Bolton, Mass.: Anker
Publishing Co.
Theall, M. et al, eds. (2001) The Student Ratings Debate: Are They Valid? How Can We Best
Use Them? New Directions for Institutional Research, No. 109. San Francisco: Jossey-Bass
Publishers.
Teacher-Course Evaluations & Test Scoring |
MLK 200 tel. (520) 621-7337, fax (520) 626-4375 |
Questions? Comments?
tceua@email.arizona.edu |
All contents copyright @ 2005 Arizona
Board of Regents |