Replacing the knitr engine for Stan
2020-08-03 UPDATE: Good news! A version of this engine is now included in versions 0.1.1 and later of {CmdStanR}. Use
cmdstanr::register_knitr_engine()
at the top of the R Markdown document to register it as the engine for stan
chunks. See the vignette
R Markdown CmdStan Engine for examples. Shoutout to the maintainers Jonah Gabry & Rok Češnovar for a super positive code review experience with the pull request for this.
I originally dabbled with custom {knitr} engine creation last month, when I made {dotnet} which enables R Markdown users to write chunks with C# and F# programs in them.
If you’re interested in creating your own, the R Markdown Cookbook has a chapter on that. You can also check out the built-in engines, {reticulate}’s Python engine, {JuliaCall}’s Julia engine, and {dotnet}’s .NET engine for examples.
If you browse the built-in engines, you’ll come across one for Stan and you’ll see that it uses {RStan}, which may be an issue on some computers. What would a CmdStan & {CmdStanR}-based engine for Stan look like if we wanted these benefits:
Is there a good summary somewhere of advantages of it over RStan?
Biggest advantages of CmdStanR (pronounced “command stan R”) are:
- New Stan features available immediately vs. after long delay with RStan
- Fewer R/RStudio crashes
More here
— Stan developers, July 30, 2020
Let’s make one!
Prerequisites
See Getting started with CmdStanR for general installation instructions and Max Mantei’s blogpost for installation on Windows.
library(knitr)
library(cmdstanr)
Engine
We want to keep the essence of the feature
the Stan model within the code chunk is compiled into a
stanmodel
object, and is assigned to a variable with the name given by theoutput.var
option.
from the existing Stan engine.
The resulting objects will be fundamentally different because {RStan} creates in an in-memory Stan model, while {CmdStanR} creates an object that holds the path to an on-disk executable, but the idea is the same – we should be able to write Stan model code in an R Markdown chunk and get back an object in R that we can use for sampling from the joint posterior distribution in the same document, like so:
```{stan, output.var="ex1"}
// example from r markdown manual
```
```{r}
fit <- ex1$sample()
fit$summary()
```
In addition to that, we want to be able to cache the chunk and keep the compiled executable so we wouldn’t have to compile every time the R Markdown document is knit/rendered.
knit_engines$set(
stan = function(options) {
code <- knitr:::one_string(options$code)
out <- options$output.var
if (!is.character(out) || length(out) != 1L) stop(
"the chunk option output.var must be a character string ",
"providing a name for the returned `CmdStanModel` object."
)
if (options$eval) {
if (options$cache) {
cache_path <- knitr:::valid_path(
options[["cache.path"]],
options$label
)
if (!dir.exists(cache_path)) {
dir.create(cache_path)
}
model_file <- tempfile(
fileext = ".stan",
tmpdir = cache_path
)
cat(code, "\n", file = model_file)
} else {
model_file <- write_stan_file(code)
}
csm <- cmdstan_model(model_file)
assign(out, csm, envir = knit_global())
}
engine_output(options, code, '')
}
)
Demonstration
Let’s try it out! Here’s our {stan bernoulli, output.var="bernoulli", cache=TRUE}
chunk:
data {
int<lower=0> N;
array[N] int<lower=0,upper=1> y;
}
parameters {
real<lower=0,upper=1> theta;
}
model {
theta ~ beta(0.5, 0.5);
y ~ bernoulli(theta);
}
2022-03-22 NOTE:
As of CmdStan 2.27.0 there’s a new array
keyword. Declaration of arrays by placing brackets after a variable name is deprecated and will be removed in Stan 2.32.0. The code above is using that new syntax. For pre-2.27.0 versions of CmdStan the following change needs to be made:
data {
int<lower=0> N;
- array[N] int<lower=0,upper=1> y;
+ int<lower=0,upper=1> y[N];
}
Let’s verify:
bernoulli$print()
## data {
## int<lower=0> N;
## array[N] int<lower=0,upper=1> y;
## }
## parameters {
## real<lower=0,upper=1> theta;
## }
## model {
## theta ~ beta(0.5, 0.5);
## y ~ bernoulli(theta);
## }
And now it’s time for inference:
y <- c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1)
fit <- bernoulli$sample(
data = list(
N = length(y),
y = y
),
refresh = 0
)
## Running MCMC with 4 sequential chains...
##
## Chain 1 finished in 0.0 seconds.
## Chain 2 finished in 0.0 seconds.
## Chain 3 finished in 0.0 seconds.
## Chain 4 finished in 0.0 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.0 seconds.
## Total execution time: 0.5 seconds.
fit
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## lp__ -6.42 -6.13 0.74 0.32 -7.95 -5.90 1.00 1630 1471
## theta 0.23 0.21 0.12 0.12 0.06 0.45 1.00 1317 1370
Neat!
- Posted on:
- July 30, 2020
- Length:
- 4 minute read, 755 words