The journey so far…

I recently received an email which said, "I'm interested in learning more about you and your journey to where you are today," so I thought I'd describe how I went from studying visual arts to analyzing data at Wikimedia Foundation (WMF).

I entered California State University - Fullerton (CSUF) with the intention to do a double major in visual arts (concentration in illustration) and mathematics (concentration in pure math). I've been doing art for years at that point (here is some of my stuff), so I wanted to become a professional. I also wanted to study math because I really enjoyed AP Calculus BC, but I specifically wanted to study pure mathematics because I wanted to go into cryptanalysis after reading The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography.

Since I was putting quite a lot of hours into my job at a grocery store to pay for school, I realized after my first semester that I just didn't have time or energy to pursue both of my passions. So I had a decision to make, and I chose math over art. That was a major turning point in my life, and I still second-guess myself about it every now and then (despite a successful career in data science).

I had several amazing math teachers in high school, and one of them introduced us to rings, groups, and fields from abstract algebra in a two-week detour during Honors Algebra 2, and I really liked that stuff! Even though I hated probability problems all throughout high school, a year into my undergraduate education, something in my head flipped. I found a calling in another branch of mathematics — a more applied branch based on probability — and I switched my concentration from Pure Mathematics to Statistics.

Dr. Sam Behseta took me under his wing as my advisor and invited me to get involved in an undergraduate research project dealing with application of statistics to neuroscience, along with two other students. We performed an assessment of several probability-distance measures such as Kullback–Leibler divergence for comparing neuron firing counts from peristimulus time histograms (PSTHs). I went on to do another undergraduate research project with him, this time using model-based clustering to find common patterns in PSTHs.

I wasn't sure about going on to do a PhD in Statistics, but I wanted to learn more. I applied as a Statistician to Valve Software, but was rejected after the phone screen. I needed to learn more before I could get a job in the industry. I applied to Carnegie Mellon University's Master in Statistical Practice (MSP) program run by Drs. Joel Greenhouse and Howard Seltman and got extremely lucky because I was actually admitted. I packed up my things and moved across the country to Pittsburgh, PA, where I learned data mining, time series analysis, survival analysis, and even got to contribute my skills to Carnegie Mellon University Department of Statistics Census Research Node (one of the few NSF-Census Research Network nodes working with U.S. Census Bureau in preparation for the 2020 census).

The focus of the MSP program was statistical consulting, and our final project involved actual consulting for actual clients. In my case, my partner and I performed statistical analysis of fMRI data for Dr. Mina Cikara. At the same time, Dr. James T. Becker at University of Pittsburgh (in partnership with CMU via CNBC) was looking for a jack-of-all-trades for his neuropsychology research program (NRP) at University of Pittsburgh Medical Center (UPMC). I've been doing neuroscience-related stuff for two years at that point and I liked it, so this opportunity was a natural fit.

I worked at NRP/UPMC for the next two years, analyzing MRI scans for studies of Alzheimer's disease and HIV-associated neurocognitive disorders. I also organized the past twenty years of data, performed ad-hoc analyses, and developed a novel method of calculating associations between MRI data and Bayesian posterior parameter estimates. But this was a soft-money position (I was hired on a grant) and I couldn't stay with them longer than those two years, so I started looking for other opportunities.

In the months leading up to that, I had serendipidously become friends with Oliver Keyes — the sole data analyst in the Discovery department at WMF at the time — on Twitter. We connected through our mutual love for R and social justice, and when they found out that I was looking for a job, they suggested I apply to the opening for Discovery's second data analyst because my statistics-focused skillset was a complement to their skillset. Of course I applied to work at Wikipedia! After going through the interview process (which was the basis for the one I would later write about), I was offered the job and Oliver and I became a team. This was two and a half years ago.

I don't know what's next in store for me. Two years in, and I'm still very happy at WMF and I get to perform lots of really cool analyses. Heck, I'm even supported in learning (and encouraged to learn) new things that aren't directly related to my job description.

Cheers~

Advice for graduates applying for data science jobs

Getting into a technical field like data science is really difficult when you're fresh out of school. On the off-chance that your potential employer actually gets the hiring process right, most organizations are still going to place a considerable amount of weight on experience over schooling. Like, yeah there are certain schools that make it a lot easier to go from academia to industry, but otherwise you're dealing with the classic catch-22 situation.

Something that can help you – and what I would notice when reviewing applications – is having something original and interesting (even if just to you) to show and talk about. It doesn't have to be published original research. It doesn't have to be a thesis. It just has to show that you can:

  • Work with real data: In most academic programs, methods are taught using clean, ready-to-use data. So it's important to show that you can take some data you found somewhere and process into something that you can glean insights from. It also gives you a chance to work with data about a topic that you personally find interesting. Possible sources of data include:
  • Explore it: Once you have a dataset that actually excites you, you should perform some EDA. Produce at least one (thoroughly labeled) visualization that shows some interesting pattern or relationship. I want to see your curiosity. I want to see an understanding that you can't just jump into model-fitting without developing some familiarity with your data first.
  • Analyze it: You're going to lose a lot of interest if you just show and talk about how you followed the steps of some tutorial verbatim. If you learn from the tutorial and then apply that methodology to a different dataset, that's basically what "experience" means. And don't try to use an overly complicated algorithm/model if the goal doesn't require it. You might get incredible accuracy classifying with deep learning, but you'll probably have a more interesting story to tell from inference with a logistic regression. Heck, at Wikimedia we use that in our anti-harassment research.
  • Present your work: It can be a neat report with an executive summary (abstract) or it can be an interactive visualization or a slide deck. Just something better than zip archive of scripts or Jupyter notebooks.
  • Explain your work (however complex) and results in a way that can be understood: This is where the first point is really important. If you're describing your analysis of data from a topic you're familiar with and are interested in, you're going to have a much easier time explaining it to a stranger. Be prepared to talk about it to a non-technical person. Be prepared to talk about it to a technical person who may not be familiar with your particular methodology. Your interviewer may have done a lot of computational lingustics & NLP but no survival analysis, so get ready to give a brief lesson on K-M curves (and vice versa).
  • Perform an analysis from start to finish: Because that's what we look for when we assign a take-home task to our candidates.

A lot of times the job postings will include a number of years as a requirement, but that's not as need-to-have as you or they might think. Secretely, it's actually a nice-to-have because "experience" is mostly a proxy for "candidate has previously used real data to solve a problem in a way that can be understood and used to inform a decision-making process." If you don't have experience, you can still demonstrate that you've done what a data scientist does.

Good luck~

Acknowledgement: I would like to thank Angela Bassa (Director of Data Science at iRobot) for her input on this post. In particular, the last paragraph is based entirely on her suggestions. She also created the Data Helpers website that lists data professionals who are able to answer questions, promote, or mentor newcomers into the field.

Installing GPU version of TensorFlow™ for use in R on Windows

The other night I got TensorFlow™ (TF) and Keras-based text classifier in R to successfully run on my gaming PC that has Windows 10 and an NVIDIA GeForce GTX 980 graphics card, so I figured I'd write up a full walkthrough, since I had to make minor detours and the official instructions assume -- in my opinion -- a certain level of knowledge that might make the process inaccessible to some folks.

Why would you want to install and use the GPU version of TF? "TensorFlow programs typically run significantly faster on a GPU than on a CPU." Graphics processing units (GPUs) are typically used to render 3D graphics for video games. As a result of the race for real-time rendering of more and more realistic-looking scenes, they have gotten really good at performing vector/matrix operations and linear algebra. While CPUs are still better for general purpose computing and there is some overhead in transferring data to/from the GPU's memory, GPUs are a more powerful resource for performing those particular calculations.

Notes: For installing on Ubuntu, you can follow RStudio's instructions. If you're interested in a Python-only (sans R) installation on Linux, follow NVIDIA's instructions.

Prerequisites

  • An NVIDIA GPU with CUDA Compute Capability 3.0 or higher. Check your GPU's compute capability here. For more details, refer to Requirements to run TensorFlow with GPU support.
  • A recent version of R -- latest version is 3.4.0 at the time of writing.
    • For example, I like using Microsoft R Open (MRO) on my gaming PC with a multi-core CPU because MRO includes and links to the multi-threaded Intel Math Kernel Library (MKL), which parallelizes vector/matrix operations.
    • I also recommend installing and using the RStudio IDE.
    • You will need devtools: install.packages("devtools", repos = c(CRAN = "https://cran.rstudio.com"))
  • Python 3.5 (required for TF at the time of writing) via Anaconda (recommended):
    1. Install Anaconda3 (in my case it was Anaconda3 4.4.0), which will install Python 3.6 (at the time of writing) but we'll take care of that.
    2. Add Anaconda3 and Anaconda3/Scripts to your PATH environment variable so that python.exe and pip.exe could be found, in case you did not check that option during the installation process. (See these instructions for how to do that.)
    3. Install Python 3.5 by opening up the Anaconda Prompt (look for it in the Anaconda folder in the Start menu) and running conda install python=3.5
    4. Verify by running python --version

Setting Up

CUDA & cuDNN

  1. Presumably you've got the latest NVIDIA drivers.
  2. Install CUDA Toolkit 8.0 (or later).
  3. Download and extract CUDA Deep Neural Network library (cuDNN) v5.1 (specifically), which requires signing up for a free NVIDIA Developer account.
  4. Add the path to the bin directory (where the DLL is) to the PATH system environment variable. (See these instructions for how to do that.) For example, mine is C:\cudnn-8.0\bin

TF & Keras in R

Once you've got R, Python 3.5, CUDA, and cuDNN installed and configured:

  1. You may need to install the dev version of the processx package: devtools::install_github("r-lib/processx") because everything installed OK for me originally but when I ran devtools::update_packages() it gave me an error about processx missing, so I'm including this optional step.
  2. Install reticulate package for interfacing with Python in R: devtools::install_github("rstudio/reticulate")
  3. Install tensorflow package: devtools::install_github("rstudio/tensorflow")
  4. Install GPU version of TF (see this page for more details):
    library(tensorflow)
    install_tensorflow(gpu = TRUE)
  5. Verify by running:
    use_condaenv("r-tensorflow")
    sess = tf$Session()
    hello <- tf$constant('Hello, TensorFlow!')
    sess$run(hello)
  6. Install keras package: devtools::install_github("rstudio/keras")

You should be able to run RStudio's examples now.

Hope this helps! :D

Yo, NieR: Automata is super awesome

This weekend I got super into a new videogame called NieR: Automata (available on PS4 and PC). I saw a bunch of folks tweeting nothing but praise about it, so I decided to check out the demo on PSN. I was so blown away by it that I actually got into my car, drove to the nearest GameStop, and picked up a copy. I cannot remember the last time a game demo did that to me, if ever. This game is ⚡️E⚡️X⚡️T⚡️R⚡️E⚡️M⚡️E⚡️L⚡️Y⚡️ 💥 ⚡️G⚡️O⚡️O⚡️D⚡️, and I highly recommend it if you're into games like DmC: Devil May Cry and other PlatinumGames titles.

It borrows so many ideas from so many games and genres, but the outcome doesn't feel like a Frankenstein's monster. It all feels cohesive.

The little touches in this game are really endearing. Like when 2B gets off a ladder and does a flip onto a platform, or when she occasionally slides down the side of a ladder. The animations feel at once both completely superfluous but also absolutely necessary.

NieR: Automata is a game that I'm glad to not be reviewing, because I would be staring at an empty document, thinking, "They should have sent a poet."[1]