Resources for learning to visualize data with R/ggplot2

"I'm currently learning visualisation with R/ggplot2 and was wondering whether you could share tips/links/videos/books/resources that helped you in your journey"

— Raya via Twitter

Tips

The only tip I'll give is that you should strive to make every chart look exactly how you want it to look and say exactly what you want it to say. You will learn in the process of doing. When it's time to visualize the data and you have an idea for a very specific look and story, don't give it up or compromise on your vision just because you don't know how to do it. Trust me, there is so much documentation out there and so many posts on Stack Overflow that you will be able to figure it out. (But also it's totally fine to get 90-95% of the way there and call it done if that last 5-10% sprint is driving you bonkers.)

Here are some charts I made in the past two and a half years at The Wikimedia Foundation:

In each case I distinctly remember wanting to express an idea in a specific way and had to learn something new or learn how to think about a problem in a new way. If you click on one of those, you will see a link to its source code on GitHub that you are welcome to look at and learn from.


Resources

I would recommend checking out the following:

…and many more! There are a lot of folks sharing their knowledge and techniques out there, and I cannot stress enough how many of them you can find out by going through Mara's exhaustive catalogue of cool things made by people in R.


Also, it just dawned on me that I can totally make this a series. I've previously written blog posts as responses to the following questions:

So I've gone back and added the new "Ask Popov" tag to those. Hopefully there will be more questions because these advice posts feel good to write!

The journey so far…

I recently received an email which said, "I'm interested in learning more about you and your journey to where you are today," so I thought I'd describe how I went from studying visual arts to analyzing data at Wikimedia Foundation (WMF).

Growing up I excelled in visual arts and mathematics at school, and they continued to be my strongest subjects. My parents and I immigrated to US from Russia when I was 10, and I spent the first few years focused on learning English – which was especially difficult because I was the only Russian-speaking person at my school. I was okay at English when I entered 6th grade, having learned a lot of it from The Simpsons of all things. That was also the year I joined band and started learning trombone, but that wouldn't last.

When I started 7th grade (in junior high school aka middle school), I didn't want to continue band so for my elective I opted into web design and that got me writing HTML and JavaScript. Soon after that I'd go home and learn to make apps in Flash using ActionScript that did server-side stuff with PHP and MySQL. I continued to develop my creative side by learning 3D modeling & rendering in LightWave, video editing in Premiere Pro, graphic design in Illustrator, and illustration in Photoshop and used those skills to make short films with my friends.

I was blessed with supportive and trusting parents. Back in Russia my friends and I would stay out late without adult supervision and it was fine. My earned my parents' trust by showing them I was responsible and didn't get into trouble, so I enjoyed a lot of autonomy in US while my parents were at work. I led a pretty balanced and diversified life. After school I split my time pretty evenly between hanging out with friends, learning cool tech things, doing homework, and playing PC games like The Sims, The Witcher, Counter-Strike. At school I'd take the Honor/AP version of every class I could and perform in plays & musical, while also hanging out with what some might call "bad apples" (guys who smoked and got into fights or joined gangs).

I entered California State University - Fullerton (CSUF) with the intention to do a double major in visual arts (concentration in illustration) and mathematics (concentration in pure math). I've been doing art for years at that point (here is some of my stuff), so I wanted to become a professional. I also wanted to study math because I really enjoyed AP Calculus BC, but I specifically wanted to study pure mathematics because I wanted to go into cryptanalysis after reading The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography.

Since I was putting quite a lot of hours into my job at a grocery store to pay for school, I realized after my first semester that I just didn't have time or energy to pursue both of my passions. So I had a decision to make, and I chose math over art. That was a major turning point in my life, and I still second-guess myself about it every now and then (despite a successful career in data science).

I had several amazing math teachers in high school, and one of them introduced us to rings, groups, and fields from abstract algebra in a two-week detour during Honors Algebra 2, and I really liked that stuff! Even though I hated probability problems all throughout high school, a year into my undergraduate education, something in my head flipped. I found a calling in another branch of mathematics — a more applied branch based on probability — and I switched my concentration from Pure Mathematics to Statistics.

Dr. Sam Behseta took me under his wing as my advisor and invited me to get involved in an undergraduate research project dealing with application of statistics to neuroscience, along with two other students. We performed an assessment of several probability-distance measures such as Kullback–Leibler divergence for comparing neuron firing counts from peristimulus time histograms (PSTHs). I went on to do another undergraduate research project with him, this time using model-based clustering to find common patterns in PSTHs.

I wasn't sure about going on to do a PhD in Statistics, but I wanted to learn more. I applied as a Statistician to Valve Software, but was rejected after the phone screen. I needed to learn more before I could get a job in the industry. I applied to Carnegie Mellon University's Master in Statistical Practice (MSP) program run by Drs. Joel Greenhouse and Howard Seltman and got extremely lucky because I was actually admitted. I packed up my things and moved across the country to Pittsburgh, PA, where I learned data mining, time series analysis, survival analysis, and even got to contribute my skills to Carnegie Mellon University Department of Statistics Census Research Node (one of the few NSF-Census Research Network nodes working with U.S. Census Bureau in preparation for the 2020 census).

The focus of the MSP program was statistical consulting, and our final project involved actual consulting for actual clients. In my case, my partner and I performed statistical analysis of fMRI data for Dr. Mina Cikara. At the same time, Dr. James T. Becker at University of Pittsburgh (in partnership with CMU via CNBC) was looking for a jack-of-all-trades for his neuropsychology research program (NRP) at University of Pittsburgh Medical Center (UPMC). I've been doing neuroscience-related stuff for two years at that point and I liked it, so this opportunity was a natural fit.

I worked at NRP/UPMC for the next two years, analyzing MRI scans for studies of Alzheimer's disease and HIV-associated neurocognitive disorders. I also organized the past twenty years of data, performed ad-hoc analyses, and developed a novel method of calculating associations between MRI data and Bayesian posterior parameter estimates. But this was a soft-money position (I was hired on a grant) and I couldn't stay with them longer than those two years, so I started looking for other opportunities.

In the months leading up to that, I had serendipidously become friends with Os Keyes — the sole data analyst in the Discovery department at WMF at the time — on Twitter. We connected through our mutual love for R and social justice, and when they found out that I was looking for a job, they suggested I apply to the opening for Discovery's second data analyst because my statistics-focused skillset was a complement to their skillset. Of course I applied to work at Wikipedia! After going through the interview process (which was the basis for the one I would later write about), I was offered the job and Oliver and I became a team. This was two and a half years ago.

I don't know what's next in store for me. Two years in, and I'm still very happy at WMF and I get to perform lots of really cool analyses. Heck, I'm even supported in learning (and encouraged to learn) new things that aren't directly related to my job description.

Cheers~

Update (2018-08-17): The thing that was missing from the original version of this post is that as my employment at NRP/UPMC was coming to an end, I applied to several PhD programs in Statistics and Statistical Computing. The schools I applied to included University of Washington, UC Davis, and my alma mater CMU. I was rejected from all of them, which was absolutely crushing at the time and sent me to a dark mental and emotional place. However, I still had nothing lined up so I decided to stop pursuing further education and instead started looking for jobs in the industry.

Advice for graduates applying for data science jobs

2019/08/01 update: things were a little different when I wrote this in 2017. These days I constantly see new/junior data scientists get rejected because they don't have the experience. Even those who have an impressive portfolio of projects to show off that they have the technical know-how get thumbs down. I firmly believe this is a failure of employers, not the new generation of recently graduated data scientists entering the field. As I tweeted earlier today:

most employers still have no idea why they need a data scientist (just that they do) nor how to support them once hired, which is why nobody wants to hire junior ones and only want to hire experienced ones who will "just know what to do" & find ways to support themselves

The point being that despite the wealth of information out there about the ways in which data science can bring value to an organization (e.g. What Data Scientists Really Do, According to 35 Data Scientists by Hugo Bowne-Anderson) and what information architecture is required to make that happen, employers are hiring senior data scientists (not always at a senior salary) because they feel like that excuses them from providing guidance, direction, and support. Those data scientists then have to find ways to make improvements and impact while also building the data infrastructure themselves (or trying to convince higher-ups to give them money to hire dedicated data engineers).

All of this to say: it's an immensely shitty situation and I'm sorry your (often very impressive!) resumes are being passed on simply because you haven't been doing this for 5+ years. So please ignore everything below the line and instead head over to Vicki Boykis's Data science is different now post where she suggests next steps for you:

  1. Don't shoot for a data science job
  2. Be prepared for most of your data scientist work to not be data science. Adjust your skillset for that.

She explains them in depth in the post, so – again – I encourage you to read it yourself.


Getting into a technical field like data science is really difficult when you're fresh out of school. On the off-chance that your potential employer actually gets the hiring process right, most organizations are still going to place a considerable amount of weight on experience over schooling. Like, yeah there are certain schools that make it a lot easier to go from academia to industry, but otherwise you're dealing with the classic catch-22 situation.

Something that can help you – and what I would notice when reviewing applications – is having something original and interesting (even if just to you) to show and talk about. It doesn't have to be published original research. It doesn't have to be a thesis. It just has to show that you can:
  • Work with real data: In most academic programs, methods are taught using clean, ready-to-use data. So it's important to show that you can take some data you found somewhere and process into something that you can glean insights from. It also gives you a chance to work with data about a topic that you personally find interesting. Possible sources of data include:
  • Explore it: Once you have a dataset that actually excites you, you should perform some EDA. Produce at least one (thoroughly labeled) visualization that shows some interesting pattern or relationship. I want to see your curiosity. I want to see an understanding that you can't just jump into model-fitting without developing some familiarity with your data first.
  • Analyze it: You're going to lose a lot of interest if you just show and talk about how you followed the steps of some tutorial verbatim. If you learn from the tutorial and then apply that methodology to a different dataset, that's basically what "experience" means. And don't try to use an overly complicated algorithm/model if the goal doesn't require it. You might get incredible accuracy classifying with deep learning, but you'll probably have a more interesting story to tell from inference with a logistic regression. Heck, at Wikimedia we use that in our anti-harassment research.
  • Present your work: It can be a neat report with an executive summary (abstract) or it can be an interactive visualization or a slide deck. Just something better than zip archive of scripts or Jupyter notebooks.
  • Explain your work (however complex) and results in a way that can be understood: This is where the first point is really important. If you're describing your analysis of data from a topic you're familiar with and are interested in, you're going to have a much easier time explaining it to a stranger. Be prepared to talk about it to a non-technical person. Be prepared to talk about it to a technical person who may not be familiar with your particular methodology. Your interviewer may have done a lot of computational lingustics & NLP but no survival analysis, so get ready to give a brief lesson on K-M curves (and vice versa).
  • Perform an analysis from start to finish: Because that's what we look for when we assign a take-home task to our candidates.
A lot of times the job postings will include a number of years as a requirement, but that's not as need-to-have as you or they might think. Secretely, it's actually a nice-to-have because "experience" is mostly a proxy for "candidate has previously used real data to solve a problem in a way that can be understood and used to inform a decision-making process." If you don't have experience, you can still demonstrate that you've done what a data scientist does.

Good luck~

Acknowledgement: I would like to thank Angela Bassa (Director of Data Science at iRobot) for her input on this post. In particular, the last paragraph is based entirely on her suggestions. She also created the Data Helpers website that lists data professionals who are able to answer questions, promote, or mentor newcomers into the field.

Probabilistic programming languages for statistical inference

This post was inspired by a question about JAGS vs BUGS vs Stan:

Explaining the differences would be too much for Twitter, so I'm just gonna give a quick explanation here.

BUGS (Bayesian inference Using Gibbs Sampling)

I was taught to do Bayesian stats using WinBUGS, which is now a very outdated (but stable) piece of software for Windows. There's also OpenBUGS, an open source version that can run on Macs and Linux PCs. Benefits include: academic papers and textbooks written in 80s, 90s, and early 2000s that use Bayesian stats might include models written in BUGS. For example, Bayesian Data Analysis (1st and 2nd editions) and Data Analysis Using Regression and Multilevel/Hierarchical Models use BUGS.

JAGS (Just Another Gibbs Sampler)

JAGS, like OpenBUGS, is available across multiple different platforms. The language it uses is basically BUGS, but with a few minor differences that require you to rewrite BUGS models to JAGS before you can run them.

I used JAGS during my time at University of Pittsburgh's neuropsych research program because we used Macs, I liked that JAGS was written from scratch, and I preferred the R interface to JAGS over the R interfaces to WinBUGS/OpenBUGS.

Stan

Stan is a newcomer and it's pretty awesome. It has a bunch of interfaces to modern data analysis tools. The language syntax was designed from scratch by people who wrote BUGS programs and thought it could be better and were inspired by R's vectorized functions. It's strict about the type of data (integer vs real number) and about parameters vs transformed parameters, which might make it harder to get into than BUGS which gives you a lot of leeway (kind of like R does), but I personally like constraints and precision since that's what allows it to be hella fast. Stan is fast because it compiles your Stan models into C++ (hence the need for strictness). I also really like Stan's Shiny app for exploring the posterior samples, which also supports MCMC output from JAGS and others.

The latest (3rd) edition of Bayesian Data Analysis has examples in Stan and Statistical Rethinking uses R and Stan, so if you're using modern textbooks to learn Bayesian statistics, you're more likely to find examples in Stan.

There are two pretty cool R interfaces to Stan that make it easier to specify your models. The first one is rethinking (accompanies the Statistical Rethinking book I linked to earlier) and then there's brms, which uses a formula syntax similar to lme4.

Stan has an active discussion board and development, so if you run into issues with a particular model or distribution, or if you're trying to do something that Stan doesn't support, you can reach out there and you'll receive help and maybe they'll even add support for whatever it is that you were trying to do.

Mostly-free resources for learning data science

In the past year or two I've had several friends approach me about learning statistics because their employer/organization was moving toward a more data-driven approach to decision making. (This brought me a lot of joy.) I firmly believe you don't actually need a fancy degree and tens of thousands of dollars in tuition debt to be able to engage with data, glean insights, and make inferences from it. And now, thanks to many wonderful statisticians on the Internet, there is now a plethora of freely accessible resources that enable curious minds to learn the art and science of statistics.

First, I recommend installing R and RStudio for actually using it. They're free and what I use for almost all of my statistical analyses. Most of the links in this post involve learning by doing statistics in R.

Okay, now on to learning stats…

Free, self-paced online courses from trustworthy institutions:

Not free online courses from trustworthy institutions:

Free books and other resources:

Book recommendations:

  • Introductory Statistics with R by Peter Dalgaard
  • Doing Data Science: Straight Talk from the Frontline by Cathy O'Neil
  • Statistics in a Nutshell by Sarah Boslaugh
  • Principles of Uncertainty by Jay Kadane (free PDF at http://uncertainty.stat.cmu.edu/)
  • Statistical Rethinking: A Bayesian Course with Examples in R and Stan by Richard McElreath

Phew! Okay, that should be enough. Feel free to suggest more in the comments below.