One of my colleagues has his own Pinewood Derby track. He picked it up when he was teaching high school science, and used Pinewood Derby as a science activity. He also uses an electronic timing system that keeps produces all of the race data.
library(dplyr) library(ggplot2) library(lme4) library(magrittr) library(pixiedust) options(pixiedust_print_method = "markdown") load("PinewoodDerby.Rdata") PinewoodDerby <- PinewoodDerby %>% mutate(car = factor(car))
The data contains the following columns:
caran ID number for the derby car.
racerInitials of the racer.
racerhave a one-to-one relationship; either one could be used as an identifer for the other.
groupA description of the racing group.
roundRound of competition. Always
1in these data.
heatThe heat number. Each heat consists of four cars. Each car runs once on each lane.
laneThe lane number on which the car was run.
timeThe number of seconds to cross the finish line.
speedThe speed of the vehicle. I’m not sure of the units on this.
placeThe finish place within the heat.
The first question that was brought up when discussing these data with my colleagues was whether the lanes themselves are biased. That is, is there evidence that any lane runs faster than another. A simple boxplot (below) doesn’t seem to show much evidence of this.
library(ggplot2) ggplot(data = PinewoodDerby, mapping = aes(x = lane, y = time, colour = factor(lane))) + geom_boxplot() + geom_jitter(width = 0.1)
We can use a couple of different models to confirm this finding. A simple approach is to use a linear model.
fit_lm <- lm(time ~ lane, data = PinewoodDerby) fit_lm %>% dust()
That’s a pretty high p-value, which doesn’t give us much reason to believe that the lanes are different. Another approach we could take is to use a mixed effects model. This would let us declare each car as a random factor.
fit_me <- lmer(time ~ (1|car) + lane, data = PinewoodDerby) fit_me %>% dust()
What’s interesting between these two models is that the coefficients didn’t change at all between them, but the t-statistics are a bit larger in the mixed model. The mixed model thinks there’s a stronger case for the lanes being a little bit different. It makes this conclusion by recognizing that there are differences between the cars themselves. In fact, we can get very similar results with the linear model
fit_lm_alt <- lm(time ~ car + lane, data = PinewoodDerby)
Unfortunately, this model requires 34 degrees of freedom (because of so many cars), when there are only 128 observations in the data set. Including that many degrees of freedom could overfit the model, and probably isn’t a good idea. Then again, stuffing all of that data into one degree of freedom as a random effect is kind of squishy, but it’s quite the same theoretical adjustment and statisticians seem to be comfortable letting that slide.
The next question of interest is whether the mixed effects model really has any advantage over the linear model. We can compare the AIC values between the two models. This shows the linear model has an AIC of 211.1232236, and the mixed effects model of 161.2001289. The mixed effects model appears to have a slightly better fit, though it’s hard to tell if this is a meaningful increase. For that we’d have to look at a likelihood ratio test, which is a topic for another day.
Ultimately, which model has a better fit isn’t of much use to us, as both models suggest that there isn’t a significant difference between lanes. If they had produced different conclusions, it would be worthwhile to explore further. For now, we can be confident that race results are being determined by which lane the cars run on.