Robust automatic jitter and smoothed bootstrap in R

July 3, 2017

The jitter {base} R function is useful for smoothed bootstrapping, but manual tuning of the jitter amount is often necessary. Should we change the jitter factor, select the alternative jitter amount estimator or directly specify the jitter amount? I thought of using order statistics for robust estimation of the jitter amount using the range-based estimator. This robust automatic jitter is demonstrated for two smoothed bootstrap examples.

Robust automatic jitter

From the R documentation, if jitter amount = NULL (which is the default) then amount = factor * d/5 where d is the smallest difference between adjacent unique x values. This jitter amount is typically smaller than the alternative estimator examined in this post.

If jitter amount = 0 then amount = factor * (max(x) – min(x)) / 50. When the range (max(x) – min(x)) is large, the jitter amount is large. Knowing that the sample range is strongly affected by sample size, outliers and not a robust estimator of scale, I thought to use the interquartile range IQR(x) instead.

To investigate the relationship between the range and IQR(x), I assumed a normal distribution for x and simulated some data below. The sample range increases with the sample size and for smaller sample sizes the range is about three times IQR(x). Accordingly, the proposed robust jitter amount = 1*3*IQR(x)/50 ≈ IQR(x)/17 (with factor = 1).

n IQR Range Range/IQR
10 1.2 2.9 2.8
20 1.2 3.7 3.1
30 1.2 4.1 3.4
40 1.3 4.3 3.4
50 1.3 4.6 3.5
100 1.3 4.9 3.8
Interquartile range (IQR) and range for the normal distribution. For smaller sample sizes, the range is about three times the IQR.

Here’s the code for the above table:

res=matrix(nrow=99, ncol=4) # results matrix
print(paste(“n”,”IQR”, “Range”, “Range/IQR”)) # header
for (i in c(10,20,30,40,50,100)){ # sample sizes
   for (j in 1:99){ # simulated data sample sizes
     x=rnorm(i)
     res[j,1]=I # sample size
     res[j,2]=IQR(x) # inter-quartile range
     res[j,3]=max(x)-min(x) # range
     res[j,4]=(max(x)-min(x))/IQR(x) # ratio of above
   }
   # print results for sample size i
   print(paste(min(res[,1]), round(mean(res[,2]),1), round(mean(res[,3]),1), round(mean(res[,4]),1)))
}

Two examples

In the two examples below, the objective was to estimate the shift in location for some paired data that were strongly skewed and included outliers. I used the median paired difference as a robust estimator for the shift in location and bootstrapping to estimate the confidence interval for the median paired difference.

For small samples, the bootstrap distribution for the median often appears with “holes” in it because the median of a bootstrap resample is usually one of the few observations in the middle of the original sample. Adding a small amount of jitter to each resample helps to fill the “holes”. Adding too much jitter results in wide bootstrap distributions. The examples below show that the proposed automatic jitter is robust against outliers and offers a reasonable compromise between smoothing and detail.

The bootstrap distribution is fairly smooth (bias = +0.04, Percentile 95% CI = −0.8450, −0.1000), jitter amount = NULL had little effect, jitter amount = 0 is perhaps too smooth (bias = +0.03, 95% CI = −0.8750, −0.0816), robust jitter is intermediate (bias = +0.05, 95% CI = −0.8141, −0.0956).

The bootstrap distribution is “holy” and percentile confidence intervals are confined to the spikes (bias = −0.08, Percentile 95% CI = −0.79, −0.13), jitter amount = NULL had little effect, jitter amount = 0 is too smooth because the range (max(x) − min (x)) was inflated by outliers in the data (bias = −0.12, 95% CI = −0.7708, −0.0805), robust jitter has the narrowest confidence interval (bias = −0.09, 95% CI = −0.7597, −0.0776).

Here’s some of the R code for the above plots:

## packages
library(MASS) # for truehist
library(boot) # for bootstrapping

## median functions for boot
# median difference function
dif.median1=function(data, i) {median(data$x1[i]-data$x2[i])}
# median amount=NULL jittered difference function for boot
# this is the R default
dif.median2=function(data,i) {median(jitter(data$x1[i]-data$x2[i]))}
# median amount=0 jittered difference function for boot
# this is the S default
dif.median3=function(data,i) {median(jitter(data$x1[i]-data$x2[i], amount=0))}
# median robust jittered difference function for boot
dif.median4=function(data,i) {median(jitter(data$x1[i]-data$x2[i], amount=myamount))}

## bootstrap median difference
# replace dif.median1 with any of the above median functions
dif.boot=boot(mydata,dif.median1,R=999)
mean(dif.boot$t) # bootstrap estimate, should be close to the median difference
boot.ci(dif.boot) # bootstrap confidence intervals
truehist(dif.boot$t, nbins=”FD”)
abline(v=median(dif.boot$t), lty=”dashed”)