Explanation of Model Performance Scores



The maps indicating measures of model performance are based on seasonal (3-month averaged) anomalies. Anomalies are departure from average seasonal conditions. Model anomalies are with respect to model climatology and observed anomalies are with respect to observed climatology. The model simulations of the climate have been forced at the lower boundary by globally observed sea surface temperatures (SSTs). Maps are currently available for anomaly correlations and R.O.C. (Relative Operating Characteristics) Scores.

ANOMALY CORRELATIONS:

The anomaly correlations are temporal correlations, at each grid point, between observed and simulated anomalies. The model simulated value is the ensemble average of 10 ensemble members for ECHAM3 and CCM3 and 13 ensemble members for NCEP-MRF9. The individual ensemble members differ from one another only by the initial state of the atmosphere (i.e. weather). Statistical significance of the correlations is determined by testing the actual correlation against 1000 cases of randomly shuffling years and recalculating correlations, thus indicating the probability of obtaining the actual correlation by chance. For a 90 per cent confidence level, less than a 10 per cent probability exists that the actual correlation occurred by chance.

A perfect correlation between observed and simulated variability is 1.0. A perfect correlation is effectively impossible to obtain, however, since Nature has a component of weather/noise that can not be separated from the boundary(SST)-forced climate signal suggested by the ensemble-averaged model results. For regions where the changes in SST do influence changes in the climate and where the model simulates the physics of the climate variability well, a high (e.g. statistically significant) correlation generally exists. High [positive] correlations indicate that the model simulated anomalies are mimicing the observed anomalies, with the right sign (positive or negative departures from average) and with the correct relative amplitudes. The magnitude of the simulated variability need not be equal to the observed variability. Anomaly correlations do not always give a true sense of a model's potential; for example, simulated anomalies that are similar to observed in being near average, but with the wrong sign, degrade the correlation. Also, a model may do well under certain circumstance, such as during wet years, but not in others, and this information will be lost in a measure such as anomaly correlation.

R.O.C. SCORES:

The relative (or receiver) operating characteristic (ROC) curve (Swets, 1973; Mason, 1982; Harvey, et al., 1992; Mason and Graham, 1999) is a method of representing forecast skill that is based on a simple 2x2 contingency table. An event of interest is pre-defined, and can be either dichotomous (e.g. the occurrence or non-occurrence of precipitation), or defined from continuous data by exceedance over a threshold (e.g. greater than 120 per cent of average rainfall), or between two limits (e.g. within 100 mm of average rainfall). A series of warnings or no-warnings for the event of interest is compared with the series of events and non-events. The ROC curve compares hit and false alarm rates, which respectively indicate the proportion of events for which a warning was provided correctly, and the proportion of non-events for which a warning was provided incorrectly. The hit rate is sometimes known as the probability of detection, or prefigurance (Olson, 1965; Panofsky and Brier, 1965; Murphy and Winkler, 1987; Doswell, et al., 1990; Harvey, et al., 1992; Wilks, 1995), and provides an estimate of the probability that an event will be forewarned. The false-alarm rate estimates the probability that for a non-event a warning will be provided incorrectly.

For a system that has no skill, the warnings and events are by definition independent occurrences, and so the probability that a warning was provided is not contingent upon an event occurring or not occurring. In other words, the probability that a warning was provided is unrelated to the outcome. Therefore, when there is no skill the hit and false-alarm rates are both equal to the prior probability of a warning being provided (Murphy and Winkler, 1987). This equality occurs when warnings are issued randomly, and when perpetual warnings or no-warnings are provided. When the forecast system has some skill, the hit rate exceeds the false-alarm rate; negative skill is indicated when the false-alarm rate exceeds the hit rate.

For probabilistic forecasts, a warning can be issued when the forecast probability for a pre-defined event exceeds some threshold (Mason, 1979). Different warning thresholds can be used for the pre-defined event, and a set of hit and false-alarm rates can then be determined. This set of hit rates is plotted against the corresponding false-alarm rates to generate the ROC curve. While there are a number of indices for summarizing the performance (Mason, 1982), the area under the curve is the most commonly used (and simplest to calculate), and has become known as the ROC score (Mason and Graham, 1999).

In general, for skilful forecast systems, the ROC curve bends toward the top left, where hit rates are larger than false-alarm rates, and the total area under the curve is then greater than 0.5. Where the curve lies close to the diagonal, the forecast system does not provide any useful information, and the area beneath the curve is approximately 0.5. If the curve lies below the line, negative skill is indicated.

References:

Doswell, C. A., R. Davies-Jones, and D. L. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables. Weather and Forecasting, 5, 576-585.

Harvey, L. O., K. R. Hammond, C. M. Lusk, and E. F. Mross, 1992: The application of signal detection theory to weather forecasting behavior. Monthly Weather Review,120, 863-883.

Mason, I., 1979: On reducing probability forecasts to yes/no forecasts. Monthly Weather Review, 107, 207-211.

Mason, I., 1982: A model for assessment of weather forecasts. Australian Meteorological Magazine, 30, 291-303.

Mason, S. J. and N. E. Graham, 1999: Conditional probabilities, relative operating characteristics and relative operating levels. Weather and Forecasting, in press.

Murphy, A. H. and R. L. Winkler, 1987: A general framework for forecast verification. Monthly Weather Review, 115, 1330-1338.

Olsen, R. H., 1965: On the use of Bayes theorem in estimating false alarm rates. Monthly Weather Review, 93, 557-558.

Swets, J. A., 1973: The relative operating characteristic in psychology. Science, 182, 990-1000.

Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.