Faster matrix math in R on macOS

If you want faster matrix operations in R on your Mac, you can use Apple's BLAS (Basic Linear Algebra Subprograms) library from their Accelerate framework instead of the library which comes with the R binary that you get from CRAN. (Unless you built R from source yourself.) CRAN recommends against this, saying:

Although fast, it is not under our control and may possibly deliver inaccurate results. [Source: R for Mac OS X FAQ]

So proceed at your own discretion, I suppose? To switch Apple's BLAS:

cd /Library/Frameworks/R.framework/Resources/lib
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib

You can verify by running sessionInfo() in R. You should see:

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

P.S. Use ln -sf libRblas.0.dylib libRblas.dylib to revert back to default.

Benchmark 1

library(microbenchmark)
d <- 1e3
x <- matrix(rnorm(d^2), d, d)
microbenchmark(tcrossprod(x), solve(x), svd(x), times = 10L)

With R's default (libRblas.0.dylib):

Unit: milliseconds
          expr       min        lq      mean    median        uq       max
 tcrossprod(x)  276.9307  278.8472  283.9451  282.5854  288.7777  299.6700
      solve(x)  681.5183  696.7888  703.4895  703.6030  712.5728  724.7195
        svd(x) 2672.3574 2692.6614 2701.0442 2698.4279 2722.2180 2730.9727

With Apple's vecLib (libBLAS.dylib):

Unit: milliseconds
          expr        min        lq      mean    median        uq       max
 tcrossprod(x)   8.642364  10.59531  10.60989  10.80781  11.35995  11.62067
      solve(x)  37.205422  41.38051  43.77767  42.34975  42.92999  54.71387
        svd(x) 278.528673 290.26926 304.74482 307.49666 317.56819 325.29980

Benchmark 2

library(microbenchmark)
set.seed(20190604)
n <- 1e3
p <- 1e2
b <- rnorm(p + 1, 0, 10)
x <- matrix(runif(n * p, -10, 10), ncol = p, nrow = n)
y <- cbind(1, x) %*% b + rnorm(n, 0, 2)
microbenchmark(lm(y ~ x))

With R's default (libRblas.0.dylib):

Unit: milliseconds
           expr     min       lq     mean   median       uq      max neval
      lm(y ~ x) 17.0145 20.27288 22.58367 22.16553 24.49879 37.37704   100

With Apple's vecLib (libBLAS.dylib):

Unit: milliseconds
           expr      min       lq    mean   median       uq      max neval
      lm(y ~ x) 5.478486 7.888307 9.210001 8.84202 10.30503 15.80064   100

Resources for learning to visualize data with R/ggplot2

"I'm currently learning visualisation with R/ggplot2 and was wondering whether you could share tips/links/videos/books/resources that helped you in your journey"

— Raya via Twitter

Tips

The only tip I'll give is that you should strive to make every chart look exactly how you want it to look and say exactly what you want it to say. You will learn in the process of doing. When it's time to visualize the data and you have an idea for a very specific look and story, don't give it up or compromise on your vision just because you don't know how to do it. Trust me, there is so much documentation out there and so many posts on Stack Overflow that you will be able to figure it out. (But also it's totally fine to get 90-95% of the way there and call it done if that last 5-10% sprint is driving you bonkers.)

Here are some charts I made in the past two and a half years at The Wikimedia Foundation:

In each case I distinctly remember wanting to express an idea in a specific way and had to learn something new or learn how to think about a problem in a new way. If you click on one of those, you will see a link to its source code on GitHub that you are welcome to look at and learn from.


Resources

I would recommend checking out the following:

…and many more! There are a lot of folks sharing their knowledge and techniques out there, and I cannot stress enough how many of them you can find out by going through Mara's exhaustive catalogue of cool things made by people in R.


Also, it just dawned on me that I can totally make this a series. I've previously written blog posts as responses to the following questions:

So I've gone back and added the new "Ask Popov" tag to those. Hopefully there will be more questions because these advice posts feel good to write!

Installing GPU version of TensorFlow™ for use in R on Windows

The other night I got TensorFlow™ (TF) and Keras-based text classifier in R to successfully run on my gaming PC that has Windows 10 and an NVIDIA GeForce GTX 980 graphics card, so I figured I'd write up a full walkthrough, since I had to make minor detours and the official instructions assume -- in my opinion -- a certain level of knowledge that might make the process inaccessible to some folks.

Why would you want to install and use the GPU version of TF? "TensorFlow programs typically run significantly faster on a GPU than on a CPU." Graphics processing units (GPUs) are typically used to render 3D graphics for video games. As a result of the race for real-time rendering of more and more realistic-looking scenes, they have gotten really good at performing vector/matrix operations and linear algebra. While CPUs are still better for general purpose computing and there is some overhead in transferring data to/from the GPU's memory, GPUs are a more powerful resource for performing those particular calculations.

Notes: For installing on Ubuntu, you can follow RStudio's instructions. If you're interested in a Python-only (sans R) installation on Linux, follow NVIDIA's instructions.

Prerequisites

  • An NVIDIA GPU with CUDA Compute Capability 3.0 or higher. Check your GPU's compute capability here. For more details, refer to Requirements to run TensorFlow with GPU support.
  • A recent version of R -- latest version is 3.4.0 at the time of writing.
    • For example, I like using Microsoft R Open (MRO) on my gaming PC with a multi-core CPU because MRO includes and links to the multi-threaded Intel Math Kernel Library (MKL), which parallelizes vector/matrix operations.
    • I also recommend installing and using the RStudio IDE.
    • You will need devtools: install.packages("devtools", repos = c(CRAN = "https://cran.rstudio.com"))
  • Python 3.5 (required for TF at the time of writing) via Anaconda (recommended):
    1. Install Anaconda3 (in my case it was Anaconda3 4.4.0), which will install Python 3.6 (at the time of writing) but we'll take care of that.
    2. Add Anaconda3 and Anaconda3/Scripts to your PATH environment variable so that python.exe and pip.exe could be found, in case you did not check that option during the installation process. (See these instructions for how to do that.)
    3. Install Python 3.5 by opening up the Anaconda Prompt (look for it in the Anaconda folder in the Start menu) and running conda install python=3.5
    4. Verify by running python --version

Setting Up

CUDA & cuDNN

  1. Presumably you've got the latest NVIDIA drivers.
  2. Install CUDA Toolkit 8.0 (or later).
  3. Download and extract CUDA Deep Neural Network library (cuDNN) v5.1 (specifically), which requires signing up for a free NVIDIA Developer account.
  4. Add the path to the bin directory (where the DLL is) to the PATH system environment variable. (See these instructions for how to do that.) For example, mine is C:\cudnn-8.0\bin

TF & Keras in R

Once you've got R, Python 3.5, CUDA, and cuDNN installed and configured:

  1. You may need to install the dev version of the processx package: devtools::install_github("r-lib/processx") because everything installed OK for me originally but when I ran devtools::update_packages() it gave me an error about processx missing, so I'm including this optional step.
  2. Install reticulate package for interfacing with Python in R: devtools::install_github("rstudio/reticulate")
  3. Install tensorflow package: devtools::install_github("rstudio/tensorflow")
  4. Install GPU version of TF (see this page for more details):
    library(tensorflow)
    install_tensorflow(gpu = TRUE)
  5. Verify by running:
    use_condaenv("r-tensorflow")
    sess = tf$Session()
    hello <- tf$constant('Hello, TensorFlow!')
    sess$run(hello)
  6. Install keras package: devtools::install_github("rstudio/keras")

You should be able to run RStudio's examples now.

Hope this helps! :D

Putting the R in romantic

I've used R for a lot of tasks unrelated to statistics or data analysis. For example, it's usually a lot easier for me to write an intelligent batch file/folder renamer or copier as an R script than a bash shell script.

Earlier today I made a collection of photos that I wanted to put on a digital picture frame to mail to my partner. I also made a set of messages that I wanted to show up randomly. What I needed to do was to shuffle the set of 260+ images in such a way that a subset of them would not show up consecutively.

To make referencing the images easier, let's call the overall set of $n$ images $Y$ (with $Y = y_1, \ldots, y_n$), and let $X \subset Y$ be the images we do not want to have consecutive pairs of after the shuffling. Let $Y' = y_{(1)}, \ldots, y_{(n)}$ be the shuffled set of images.

This was really easy to accomplish in R. I started with k <- 0; set.seed(k) and shuffled all the images (using sample.int()). Then I checked whether our very specific requirement was or was not met.

If we did end up with a pair of consecutive images from $X$, we increment $k$ by 1 and repeat the procedure until $\{y_{(i-1)}, y_{(i)}\} \not\subset X ~\forall~i = 2, \ldots, n$.

I think what makes R really nice to use for tasks like this is vectorized functions and binary operators like which(), %in%, order(), duplicated(), sample(), sub(), and grepl(), as well as data.frames that you can expand to include additional data, such as indicators of whether row $m$ is related to row $m-1$.

Next time you have to do something on the computer that is repetitive and time-consuming, I urge you to consider writing a script/program to do it for you if you know R but haven't considered it before for doing file organization.

Cheers~

Mostly-free resources for learning data science

In the past year or two I've had several friends approach me about learning statistics because their employer/organization was moving toward a more data-driven approach to decision making. (This brought me a lot of joy.) I firmly believe you don't actually need a fancy degree and tens of thousands of dollars in tuition debt to be able to engage with data, glean insights, and make inferences from it. And now, thanks to many wonderful statisticians on the Internet, there is now a plethora of freely accessible resources that enable curious minds to learn the art and science of statistics.

First, I recommend installing R and RStudio for actually using it. They're free and what I use for almost all of my statistical analyses. Most of the links in this post involve learning by doing statistics in R.

Okay, now on to learning stats…

Free, self-paced online courses from trustworthy institutions:

Not free online courses from trustworthy institutions:

Free books and other resources:

Book recommendations:

  • Introductory Statistics with R by Peter Dalgaard
  • Doing Data Science: Straight Talk from the Frontline by Cathy O'Neil
  • Statistics in a Nutshell by Sarah Boslaugh
  • Principles of Uncertainty by Jay Kadane (free PDF at http://uncertainty.stat.cmu.edu/)
  • Statistical Rethinking: A Bayesian Course with Examples in R and Stan by Richard McElreath

Phew! Okay, that should be enough. Feel free to suggest more in the comments below.