This data set was derived by Douglas and Delampady (1990) from the Eastern Lakes Survey of 1984. It contains measurements of 1789 lakes in three Eastern US regions: Northeast, Upper Midwest and Southeast. Of interest is the dependence of the water pH level () on the calcium concentration in milligrams per liter () and the geographical location ( where =latitude and =longitude). Gu and Wahba (1993a) analyzed this data set using SS ANOVA models. As in Gu and Wahba (1993a), we use a subset of 112 lakes in the southern Blue Ridge mountains area.
We use this data set to illustrate how to fit a cubic spline, a cubic spline with correlated random errors, a SLM and an SS ANOVA model.
First, we fit a cubic spline to using one variable calcium ()
> acid.fit1 <- ssr(ph~t1, rk=cubic(t1), data=acid, scale=T) > summary(acid.fit1) ... GCV estimate(s) of smoothing parameter(s) : 3.84e-06 Equivalent Degrees of Freedom (DF): 8.21 Estimate of sigma: 0.281 > anova(acid.fit1) Testing H_0: f in the NULL space test.value simu.size simu.p-value LMP 0.003634651 100 0.04 GCV 0.008239078 100 0.02
Both p-values from the LMP and GCV tests are small, indicating that a simple linear model is not sufficient to describe the relationship. This can be confirmed by looking at the fitted function and its projections onto and . A linear model is equivalent to the projection onto being zero. The following statements compute posterior means and standard deviations for the projections onto with terms=c(1,1,0), the projections onto with terms=c(0,0,1), and the overall fit with terms=c(1,1,1).
> grid <- data.frame(t1=seq(min(acid$t1), max(acid$t1), len=100)) > tm <- matrix(c(1,1,0,0,0,1,1,1,1), ncol=3, byrow=T) > p.acid.fit1 <- predict(acid.fit1, grid, terms=tm)
Figure shows the fitted function and its projections. The nonlinear part is small, but different from zero.
pH observations close in geographic locations are often correlated. Suppose that we want to use spherical spatial correlation structure with nugget effect for the variable location . That is, regard random errors as a function of , and assume
> acid$s1 <- (acid$t1-min(acid$t1))/diff(range(acid$t1)) > acid.fit2 <- ssr(ph~s1, rk=cubic(s1), data=acid, corr=corSpher(form=~x1+x2,nugget=T), spar="m") > acid.fit2 ... Coefficients (d): (Intercept) s1 6.221757 1.264431 GML estimate(s) of smoothing parameter(s) : 4.645508 Equivalent Degrees of Freedom (DF): 2.000284 Estimate of sigma: 0.2994075 Correlation structure of class corSpher representing range nugget 0.03646616 0.68530949
We plot in Figure fitted curves from acid.fit1, acid.fit2 and acid.fit5. Notice the effect of the correlation on the smoothing parameter and the fit. Equivalent degrees of freedom for decreased from 8.21 to 2.00. The new fit is linear. The smaller smoothing parameter in the first fit might be caused by the spatial correlation.
We can also model the effect of location directly as a covariate. We consider two approaches to model location effect: use random effects with exponential spatial correlation structure (similar to Kriging), and use thin plate splines.
For the first approach, we consider
Denote as the vector of design points for . Let be the vector of the process evaluated at design points. are random effects and , where depends parameters and . Again, the SLM model () cannot be fitted directly using slm since depends on the range parameter nonlinearly. We fit model () in two steps. We first regard as part of random error, estimate the range parameter, and calculate the estimated covariance matrix without nugget effect:
> temp <- ssr(ph~t1+x1+x2, rk=tp(t1), data=acid, corr=corExp(form=~x1+x2, nugget=T), spar="m") > tau <- coef(temp$cor.est, F) > D <- corMatrix(initialize(corExp(tau[1],form=~x1+x2), data=acid))
Consider the estimated as the true covariance matrix. Then we can calculate the Cholesky decomposition of as , and transform the random effects , where . Now we are ready to fit the transformed SLM:
> Z <- chol.new(D) > acid.fit3 <- slm(ph~t1+x1+x2, rk=tp(t1), data=acid, random=list(pdIdent(~Z-1))) > summary(acid.fit3) ... Coefficients (d): (Intercept) t1 x1 x2 6.5483884 0.6795706 -8.3694493 2.1135023 GML estimate(s) of smoothing parameter(s) : 0.0005272899 Equivalent Degrees of Freedom (DF): 4.000179 Estimate of sigma: 0.002725062
We then calculate the estimated effect of calcium:
grid1 <- data.frame(t1=seq(min(acid$t1), max(acid$t1), len=100), x1=min(acid$x1), x2=min(acid$x2)) p.acid.fit3.t1 <- intervals(acid.fit3, grid1, terms=c(0,1,0,0,1))
Suppose that we want to calculate the location effect on grid points . Let be the vector of the process evaluated at elements in . Let . Then .
grid2 <- expand.grid( x1=seq(min(acid$x1)-.001,max(acid$x1)+.001, len=20), x2=seq(min(acid$x2)-.001,max(acid$x2)+.001, len=20)) newdata <- data.frame(y1=c(acid$x1,grid2$x1), y2=c(acid$x2,grid2$x2)) RD <- corMatrix(initialize(corExp(tau[1], form=~y1+y2), data=newdata)) R <- RD[(length(acid$x1)+1):length(newdata$y1),1:length(acid$x1)] u.new <- R%*%t(solve(Z))%*%as.vector(acid.fit3$lme.obj$coef$random[[2]]) p.acid.fit3.t2 <- acid.fit3$lme.obj$coef$fixed[3]*grid2$x1+ acid.fit3$lme.obj$coef$fixed[4]*grid2$x2+u.new
Figure plots the estimated main effects of and on the left panel.
As the second approach, we consider the same thin plate spline
model as in Gu and Wahba (1993a). Specifically,
we use a TPS with and to model
the effect of calcium, and a TPS with and to model
the effect of location. Then we have the following
SS ANOVA model
> acid.fit4 <- ssr(ph~t1+x1+x2, rk=list(tp(t1), tp(list(x1,x2)), rk.prod(kron(t1),tp(list(x1,x2))), rk.prod(kron(x1),tp(t1)), rk.prod(kron(x2),tp(t1)), rk.prod(tp(t1),tp(list(x1,x2)))), data=acid, spar="m") > summary(acid.fit4) ... Coefficients (d): (Intercept) t1 x1 x2 6.555506 0.6235892 -8.783944 1.974596 GML estimate(s) of smoothing parameter(s) : 1.816737e+04 3.600865e-03 2.710276e+01 2.759507e+00 1.992876e+01 1.440841e-01 Equivalent Degrees of Freedom (DF): 10.7211 Estimate of sigma: 0.2560693
Since the smoothing parameters corresponding to the interaction terms are large, it is easy to check that these interaction terms are small. Thus we reduce to the following additive model with main effects only
> acid.fit5 <- update(acid.fit3, rk=list(tp(t1), tp(list(x1,x2)))) > summary(acid.fit5) ... ... Coefficients (d): (Intercept) t1 x1 x2 6.555502 0.6235885 -8.783569 1.974339 GML estimate(s) of smoothing parameter(s) : 6.045750e+03 3.600423e-03 Equivalent Degrees of Freedom (DF): 10.72108 Estimate of sigma: 0.2560684 > grid3 <- data.frame(t1=seq(min(acid$t1), max(acid$t1), len=100), x1=min(acid$x1), x2=min(acid$x2)) > p.acid.fit5.t1 <- predict(acid.fit5, grid3, terms=c(0,1,0,0,1,0)) > grid4 <- expand.grid(t1=min(acid$t1), x1=seq(min(acid$x1), max(acid$x1), len=20), x2=seq(min(acid$x2), max(acid$x2), len=20)) > p.acid.fit5.t2 <- predict(acid.fit5, grid4, terms=c(0,0,1,1,0,1))
Figure plots the estimated main effects of and on the right panel. It is seen that the estimates of the calcium main effects are almost identical. The estimates of the location main effects have a similar pattern. The estimate from acid.fit5 is smoother.
> grid3 <- data.frame(t1=seq(min(acid$t1), max(acid$t1), len=100), x1=min(acid$x1), x2=min(acid$x2)) > p.acid.fit5.t1 <- predict(acid.fit5, grid3, terms=c(0,1,0,0,1,0)) > grid4 <- expand.grid(t1=min(acid$t1), x1=seq(min(acid$x1), max(acid$x1), len=20), x2=seq(min(acid$x2), max(acid$x2), len=20)) > p.acid.fit5.t2 <- predict(acid.fit5, grid4, terms=c(0,0,1,1,0,1))