Sunday, 1 January 2012

Rohde et al. 2011ish - Berkeley Earth Temperature Averaging Process

Rohde et al. (2011) The paper describes a process for estimating the temperature at any point on the Earth’s land surface using a discontinuous network of stations of any geographical distribution. The method was applied to the GHCN network and an estimate of the global average temperature was calculated back to 1800. Uncertainties in the global average were also calculated that account for spatial sampling and data errors.
The paper is important in two ways. Firstly from a scientific perspective it is important because it takes a new and statistically innovative look at the problem of estimating global land temperatures. That it confirms global average land surface air temperature trends is unsurprising, its greatest scientific impact will be in helping to elucidate trends and variability at smaller scales where uncertainties are much larger. The length of the record is 50 years longer than the longest current estimate produced by CRU and the proposed uncertainty range is narrower than those estimated for other data sets. Secondly it has a role to play in the wider discussion of climate change about which I will say nothing other than to acknowledge it.
The paper has not yet been peer-reviewed in the conventional sense, but instead placed on the web along with a bundle of data and code in parallel with the journal submission. A number of other informal reviews already exist and my comments will no doubt overlap with some of them.
The first thing to note is that the global average land surface air temperature produced by the Berkeley group is very close to those produced by NASA GISS, NOAA NCDC and Hadley CRU at least as far back as 1900. This is unsurprising. Before this date the global network thins out significantly and the four estimates diverge. In the Rohde paper, the post-1900 agreement does not appear as strong because the four data sets are not compared in an exactly like for like fashion. The GISS and CRU estimates shown are estimates of the global average temperature based on land stations rather than an estimate of the global average land temperature. As I understand it, the GISS estimate has been corrected, but the CRU estimate is still the wrong one for comparison.
Back before 1900, the uncertainties are generally larger, but there is a persistent decadal scale difference in the data with the Berkeley data set running cooler than the other data sets. This difference is most noticeable in comparison with the longer CRU data set which is around 0.5K warmer in 1850.
What did they do?
The averaging process is based on Kriging, a method that deals naturally with uneven geographical sampling. Temperatures are decomposed into a global average, a climatological average, and a local temperature anomaly. The climatological average is further broken down as a function of elevation and latitude (which together explain 95% of the variance) with the local deviations modelled using a simple local correlation function. The local temperature anomalies are assumed to have a simple local correlation function which more or less decays exponentially.
A number of add ons are included to deal with station moves, stations that are unrepresentative of the local climate and data outliers. Station moves are handled with what the authors call the ‘scalpel’ which cuts station records where a neighbour comparison implies a discontinuity in the series. The two fragments either side of the cut are dealt with as individual stations. Non-representative series – defined as those that diverge from the estimated value at the location of the station – are down-weighted iteratively.
These components are poured in to the statistical meat grinder and an estimate of the global average temperature pops out. Although the authors claim to have removed the need for gridding the data, they evaluate the integrals over the Earth’s surface numerically which amounts to the same thing.
To estimate data-based uncertainties they use subsets of the data and assess how the spread of the estimates based on fewer stations affects the global average. They also make a separate estimation of the spatial sampling uncertainty by applying the historical spatial weighting functions that are output by their averaging process to later, more completely sampled epochs.
By-products of the process include an estimate of the annual climatological land temperature (about 8.5+-0.5C), and an estimate of the bias at each station.
The chief criticism I have concerns their uncertainty analysis. They claim to have narrowed the uncertainty range for global average land temperature estimates, but at certain points in the manuscript uncertainties are acknowledged that they explicitly do not tackle or cannot tackle on their own. These include the problems associated with prevalent and widespread biases, the fixed parameters and analysis choices within their framework, the structural uncertainty. All of these problems will have a greater effect the further back in time the analysis is taken.
The most obvious fixed parameter is the correlation function used to do the kriging. This involves a whole suite of choices, principally the choice to use a 4th order polynomial in the exponential and the choice to use correlations rather than covariances. As they note this latter choice makes sense if variances do not vary rapidly with distance – an assumption not supported in the text and likely to have a greater effect in the earlier record where the stations are predominantly coastal and being used to infer continental temperature variability. Fixing the form of the correlation function hides a lot of local behaviour which is nevertheless shown in their Figures 2 and 3.
It is also not clear what their correlation function shows either. It is calculated from pairs of neighbouring stations and the correlation at zero separation is interpreted in terms of data error: two stations at the same notional location would exhibit different variability due to the exact circumstances of the station siting. However, in their formulation there is no allowance for data error and the correlation functions should represent the underlying temperature field that they are trying to estimate rather than the measured temperature field (the true temperatures will generally have higher correlations than the measured temperatures).
Other choices are the use of neighbour composites to decide where station breaks are for the scalpel, the ad hoc weighting procedures for the station reliability and outlier assessments. It has already been noted elsewhere that – as with most unsupervised algorithms – the scalpel occasionally makes unusual changes to stations that seem counterintuitive.
Exactly how important these choices – and others – are is impossible to ascertain by simply reading the text. The method is too complex for me to imagine my way through it. As it is, the uncertainty ranges, particularly in the early data where biases are expected to be larger, seem too narrow and the fields that are produced seem too smooth. This leaves a large question mark hanging over the larger variability in the early 19th century and the usefulness of the analysis at small scales. Without a more thorough quantification of the sensitivity of the analysis to choices and parameter settings, it is not possible to place a great deal of trust in the estimates of the earliest data or the smaller scale features.
Regarding their interpretation of their analysis as being the most comprehensive and the best: this may well be true, but this does not get them over the problem that there is uncertainty inherent in their choice of processing algorithms. Even if the other data sets are inferior – and there are no grounds for supposing that they are – then they still help to map out the structural uncertainty because their approaches to the problem are all very different.
The authors note that another way to assess structural uncertainty is to look at factors that are thought to affect the quantity under question. They make reference to two other submitted publications (also found on their website) which deal with urbanisation and with station siting in the US. These other analyses will be dealt with separately, but do not, I think, shed a great deal of light on these already well illuminated areas of study.
A new method has been devised which offers a great opportunity to more fully understand the uncertainties in estimates of global average land temperature going back for the first time to 1800. However, as might be expected with a new approach it is not clear that the potential has been realised. Without a more thorough assessment of the algorithm’s behaviour and associated uncertainties it is not possible to assess its success in reducing those uncertainties and sharpening our view of historical climate change.
Where it fits in to our understanding is of interest although perhaps too early to say. The method is closest methodologically to GISS, using local structure rather than large scale structure to interpolate the data. Also like GISS the analysis makes use of shorter data records. As with GISS, the fields produced have a certain smoothness that will underestimate local variability. This is possibly more representative of the platonic-ideal large scale temperature fields that these groups are trying to assess, but that is a matter for debate (what the hell are they actually measuring?) The lack of smoothing in the CRU data set suggests that this might still be a better choice for looking at small scale variability, but the analysis will be susceptible to micro-siting issues. The NOAA analysis by making use of teleconnections to reconstruct the large scale temperature field might be the more reliable record earlier on. There are various ways to test these suppositions and the ISTI plans to look at some of these using idealised benchmark tests.

1 comment: