Replacing the knitr engine for Stan

2020-08-03 UPDATE: Good news! A version of this engine is now included in versions 0.1.1 and later of {CmdStanR}. Use cmdstanr::register_knitr_engine() at the top of the R Markdown document to register it as the engine for stan chunks. See the vignette R Markdown CmdStan Engine for examples. Shoutout to the maintainers Jonah Gabry & Rok Češnovar for a super positive code review experience with the pull request for this.

I originally dabbled with custom {knitr} engine creation last month, when I made {dotnet} which enables R Markdown users to write chunks with C# and F# programs in them.

If you’re interested in creating your own, the R Markdown Cookbook has a chapter on that. You can also check out the built-in engines, {reticulate}’s Python engine, {JuliaCall}’s Julia engine, and {dotnet}’s .NET engine for examples.

If you browse the built-in engines, you’ll come across one for Stan and you’ll see that it uses {RStan}, which may be an issue on some computers. What would a CmdStan & {CmdStanR}-based engine for Stan look like if we wanted these benefits:

Let’s make one!

Prerequisites

See Getting started with CmdStanR for general installation instructions and Max Mantei’s blogpost for installation on Windows.

library(knitr)
library(cmdstanr)

Engine

We want to keep the essence of the feature

the Stan model within the code chunk is compiled into a stanmodel object, and is assigned to a variable with the name given by the output.var option.

from the existing Stan engine.

The resulting objects will be fundamentally different because {RStan} creates in an in-memory Stan model, while {CmdStanR} creates an object that holds the path to an on-disk executable, but the idea is the same – we should be able to write Stan model code in an R Markdown chunk and get back an object in R that we can use for sampling from the joint posterior distribution in the same document, like so:

```{stan, output.var="ex1"}
// example from r markdown manual
```

```{r}
fit <- ex1$sample()
fit$summary()
```

In addition to that, we want to be able to cache the chunk and keep the compiled executable so we wouldn’t have to compile every time the R Markdown document is knit/rendered.

knit_engines$set(
  stan = function(options) {
    code <- knitr:::one_string(options$code)
    out <- options$output.var
    if (!is.character(out) || length(out) != 1L) stop(
      "the chunk option output.var must be a character string ",
      "providing a name for the returned `CmdStanModel` object."
    )
    if (options$eval) {
      if (options$cache) {
        cache_path <- knitr:::valid_path(
          options[["cache.path"]],
          options$label
        )
        if (!dir.exists(cache_path)) {
          dir.create(cache_path)
        }
        model_file <- tempfile(
          fileext = ".stan",
          tmpdir = cache_path
        )
        cat(code, "\n", file = model_file)
      } else {
        model_file <- write_stan_file(code)
      }
      csm <- cmdstan_model(model_file)
      assign(out, csm, envir = knit_global())
    }
    engine_output(options, code, '')
  }
)

Demonstration

Let’s try it out! Here’s our {stan bernoulli, output.var="bernoulli", cache=TRUE} chunk:

data { 
  int<lower=0> N; 
  array[N] int<lower=0,upper=1> y;
} 
parameters {
  real<lower=0,upper=1> theta;
} 
model {
  theta ~ beta(0.5, 0.5);
  y ~ bernoulli(theta);
}

NOTE: As of CmdStan 2.27.0 there’s a new array keyword. Declaration of arrays by placing brackets after a variable name is deprecated and will be removed in Stan 2.32.0. The code above is using that new syntax. For pre-2.27.0 versions of CmdStan the following change needs to be made:

data { 
  int<lower=0> N; 
-  array[N] int<lower=0,upper=1> y;
+  int<lower=0,upper=1> y[N];
} 

Let’s verify:

bernoulli$print()
## data { 
##   int<lower=0> N; 
##   array[N] int<lower=0,upper=1> y;
## } 
## parameters {
##   real<lower=0,upper=1> theta;
## } 
## model {
##   theta ~ beta(0.5, 0.5);
##   y ~ bernoulli(theta);
## }

And now it’s time for inference:

y <- c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1)

fit <- bernoulli$sample(
  data = list(
    N = length(y),
    y = y
  ),
  refresh = 0
)
## Running MCMC with 4 sequential chains...
## 
## Chain 1 finished in 0.0 seconds.
## Chain 2 finished in 0.0 seconds.
## Chain 3 finished in 0.0 seconds.
## Chain 4 finished in 0.0 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.0 seconds.
## Total execution time: 0.5 seconds.
fit
##  variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
##     lp__  -6.42  -6.13 0.74 0.32 -7.95 -5.90 1.00     1630     1471
##     theta  0.23   0.21 0.12 0.12  0.06  0.45 1.00     1317     1370

Neat!

Posted on:
July 30, 2020
Length:
4 minute read, 746 words
Tags:
R bayesian
See Also:
Wikipedia Preview for R Markdown documents
Even faster matrix math in R on macOS with M1
Making Of: Session Tick visualization