Probabilistic programming languages for statistical inference

This post was inspired by a question about JAGS vs BUGS vs Stan:

Explaining the differences would be too much for Twitter, so I'm just gonna give a quick explanation here.

BUGS (Bayesian inference Using Gibbs Sampling)

I was taught to do Bayesian stats using WinBUGS, which is now a very outdated (but stable) piece of software for Windows. There's also OpenBUGS, an open source version that can run on Macs and Linux PCs. Benefits include: academic papers and textbooks written in 80s, 90s, and early 2000s that use Bayesian stats might include models written in BUGS. For example, Bayesian Data Analysis (1st and 2nd editions) and Data Analysis Using Regression and Multilevel/Hierarchical Models use BUGS.

JAGS (Just Another Gibbs Sampler)

JAGS, like OpenBUGS, is available across multiple different platforms. The language it uses is basically BUGS, but with a few minor differences that require you to rewrite BUGS models to JAGS before you can run them.

I used JAGS during my time at University of Pittsburgh's neuropsych research program because we used Macs, I liked that JAGS was written from scratch, and I preferred the R interface to JAGS over the R interfaces to WinBUGS/OpenBUGS.

Stan

Stan is a newcomer and it's pretty awesome. It has a bunch of interfaces to modern data analysis tools. The language syntax was designed from scratch by people who wrote BUGS programs and thought it could be better and were inspired by R's vectorized functions. It's strict about the type of data (integer vs real number) and about parameters vs transformed parameters, which might make it harder to get into than BUGS which gives you a lot of leeway (kind of like R does), but I personally like constraints and precision since that's what allows it to be hella fast. Stan is fast because it compiles your Stan models into C++ (hence the need for strictness). I also really like Stan's Shiny app for exploring the posterior samples, which also supports MCMC output from JAGS and others.

The latest (3rd) edition of Bayesian Data Analysis has examples in Stan and Statistical Rethinking uses R and Stan, so if you're using modern textbooks to learn Bayesian statistics, you're more likely to find examples in Stan.

There are two pretty cool R interfaces to Stan that make it easier to specify your models. The first one is rethinking (accompanies the Statistical Rethinking book I linked to earlier) and then there's brms, which uses a formula syntax similar to lme4.

Stan has an active discussion board and development, so if you run into issues with a particular model or distribution, or if you're trying to do something that Stan doesn't support, you can reach out there and you'll receive help and maybe they'll even add support for whatever it is that you were trying to do.

Putting the R in romantic

I've used R for a lot of tasks unrelated to statistics or data analysis. For example, it's usually a lot easier for me to write an intelligent batch file/folder renamer or copier as an R script than a bash shell script.

Earlier today I made a collection of photos that I wanted to put on a digital picture frame to mail to my partner. I also made a set of messages that I wanted to show up randomly. What I needed to do was to shuffle the set of 260+ images in such a way that a subset of them would not show up consecutively.

To make referencing the images easier, let's call the overall set of $n$ images $Y$ (with $Y = y_1, \ldots, y_n$), and let $X \subset Y$ be the images we do not want to have consecutive pairs of after the shuffling. Let $Y' = y_{(1)}, \ldots, y_{(n)}$ be the shuffled set of images.

This was really easy to accomplish in R. I started with k <- 0; set.seed(k) and shuffled all the images (using sample.int()). Then I checked whether our very specific requirement was or was not met.

If we did end up with a pair of consecutive images from $X$, we increment $k$ by 1 and repeat the procedure until $\{y_{(i-1)}, y_{(i)}\} \not\subset X ~\forall~i = 2, \ldots, n$.

I think what makes R really nice to use for tasks like this is vectorized functions and binary operators like which(), %in%, order(), duplicated(), sample(), sub(), and grepl(), as well as data.frames that you can expand to include additional data, such as indicators of whether row $m$ is related to row $m-1$.

Next time you have to do something on the computer that is repetitive and time-consuming, I urge you to consider writing a script/program to do it for you if you know R but haven't considered it before for doing file organization.

Cheers~

Mostly-free resources for learning data science

In the past year or two I've had several friends approach me about learning statistics because their employer/organization was moving toward a more data-driven approach to decision making. (This brought me a lot of joy.) I firmly believe you don't actually need a fancy degree and tens of thousands of dollars in tuition debt to be able to engage with data, glean insights, and make inferences from it. And now, thanks to many wonderful statisticians on the Internet, there is now a plethora of freely accessible resources that enable curious minds to learn the art and science of statistics.

First, I recommend installing R and RStudio for actually using it. They're free and what I use for almost all of my statistical analyses. Most of the links in this post involve learning by doing statistics in R.

Okay, now on to learning stats…

There's Data Analysis and Statistical Inference + interactive companion course by Mine Çetinkaya-Rundel (Duke University). She has also written the OpenIntro to Statistics book (available for free as a PDF).

Free, self-paced online courses from trustworthy institutions:

Not free online courses from trustworthy institutions:

Free books and other resources:

Book recommendations:

  • Introductory Statistics with R by Peter Dalgaard
  • Doing Data Science: Straight Talk from the Frontline by Cathy O'Neil
  • Statistics in a Nutshell by Sarah Boslaugh
  • Principles of Uncertainty by Jay Kadane (free PDF at http://uncertainty.stat.cmu.edu/)
  • Statistical Rethinking: A Bayesian Course with Examples in R and Stan by Richard McElreath

Phew! Okay, that should be enough. Feel free to suggest more in the comments below.

Freelancing Hourly Rate Calculator (Shiny app)

The other day I got tired of basically coming up with random hourly rate estimates for freelancing projects because I actually never sat down to figure out what the hell my hourly rate should be. I found a great blog post How to Calculate Hourly Freelance Rates for Web Design, Development Work and made a spreadsheet with the appropriate formulas.

But then I wanted to combine the explanation of the blog post with the dynamic aspect of the spreadsheet. So I opened up R and wrote a Shiny app where you can specify all the different numbers and percentages and it’ll update the plots and details of how the final rate was calculated.

If you want to figure out what you should be charging your clients, go to http://bearloga.shinyapps.io/freelancr/

Words, words, words

I needed a list of adverbs/adjectives that start with "do." First I tried Wolfram|Alpha but that couldn't filter the list to adjectives and there's no way to build a query pipeline (at least with a free account). I ended up using the wordnet package in R:

require(magrittr) # install.packages('magrittr')
require(wordnet) # install.packages('wordnet')
getTermFilter('StartsWithFilter','do',TRUE) %>%
    getIndexTerms('ADVERB',1e4,.) %>% sapply(getLemma) %>%
        paste(collapse=', ')

Output: doctrinally, doggedly, doggo, dogmatically, dolce, dolefully, doltishly, domestically, domineeringly, dorsally, dorsoventrally, dottily, double, double quick, double time, doubly, doubtfully, doubtless, doubtlessly, dourly, dowdily, down, down the stairs, downfield, downhill, downright, downriver, downstage, downstairs, downstream, downtown, downward, downwardly, downwards, downwind

P.S. If you're on OS X, you can use MacPorts to install WordNet with: sudo port install wordnet

Then select the port-installed dictionary in R with: setDict('/opt/local/share/WordNet-3.0/dict')