Pinewood Derby

28 Oct 2017

One of my colleagues has his own Pinewood Derby track. He picked it up when he was teaching high school science, and used Pinewood Derby as a science activity. He also uses an electronic timing system that keeps produces all of the race data.

Let’s explore!

Get the data

library(dplyr)
library(ggplot2)
library(lme4)
library(magrittr)
library(pixiedust)

options(pixiedust_print_method = "markdown")

load("PinewoodDerby.Rdata")

PinewoodDerby <- 
  PinewoodDerby %>% 
  mutate(car = factor(car))

The data contains the following columns:

The first question that was brought up when discussing these data with my colleagues was whether the lanes themselves are biased. That is, is there evidence that any lane runs faster than another. A simple boxplot (below) doesn’t seem to show much evidence of this.

library(ggplot2)
ggplot(data = PinewoodDerby,
       mapping = aes(x = lane,
                     y = time,
                     colour = factor(lane))) + 
  geom_boxplot() + 
  geom_jitter(width = 0.1)

We can use a couple of different models to confirm this finding. A simple approach is to use a linear model.

fit_lm <- 
  lm(time ~ lane, 
    data = PinewoodDerby) 

fit_lm %>% 
  dust()
term estimate std.error statistic p.value
(Intercept) 3.5709875 0.0953411 37.4548689 0
lane2 0.0813344 0.1348326 0.6032247 0.547461
lane3 0.0736063 0.1348326 0.5459082 0.5861098
lane4 0.1048 0.1348326 0.7772599 0.4384858

That’s a pretty high p-value, which doesn’t give us much reason to believe that the lanes are different. Another approach we could take is to use a mixed effects model. This would let us declare each car as a random factor.

fit_me <- 
   lmer(time ~ (1|car) + lane,
        data = PinewoodDerby) 

fit_me %>% 
   dust() 
term estimate std.error statistic group
(Intercept) 3.5709875 0.0953411 37.4548684 fixed
lane2 0.0813344 0.08009 1.0155376 fixed
lane3 0.0736063 0.08009 0.9190446 fixed
lane4 0.1048 0.08009 1.3085285 fixed
sd_(Intercept).car 0.4338744 NA NA car
sd_Observation.Residual 0.3203599 NA NA Residual

What’s interesting between these two models is that the coefficients didn’t change at all between them, but the t-statistics are a bit larger in the mixed model. The mixed model thinks there’s a stronger case for the lanes being a little bit different. It makes this conclusion by recognizing that there are differences between the cars themselves. In fact, we can get very similar results with the linear model

fit_lm_alt <- lm(time ~ car + lane, 
                 data = PinewoodDerby)

Unfortunately, this model requires 34 degrees of freedom (because of so many cars), when there are only 128 observations in the data set. Including that many degrees of freedom could overfit the model, and probably isn’t a good idea. Then again, stuffing all of that data into one degree of freedom as a random effect is kind of squishy, but it’s quite the same theoretical adjustment and statisticians seem to be comfortable letting that slide.

The next question of interest is whether the mixed effects model really has any advantage over the linear model. We can compare the AIC values between the two models. This shows the linear model has an AIC of 211.1232236, and the mixed effects model of 161.2001289. The mixed effects model appears to have a slightly better fit, though it’s hard to tell if this is a meaningful increase. For that we’d have to look at a likelihood ratio test, which is a topic for another day.

Ultimately, which model has a better fit isn’t of much use to us, as both models suggest that there isn’t a significant difference between lanes. If they had produced different conclusions, it would be worthwhile to explore further. For now, we can be confident that race results are being determined by which lane the cars run on.