Faster matrix math in R on macOS

If you want faster matrix operations in R on your Mac, you can use Apple's BLAS (Basic Linear Algebra Subprograms) library from their Accelerate framework instead of the library which comes with the R binary that you get from CRAN. (Unless you built R from source yourself.) CRAN recommends against this, saying:

Although fast, it is not under our control and may possibly deliver inaccurate results. [Source: R for Mac OS X FAQ]

So proceed at your own discretion, I suppose? To switch Apple's BLAS:

cd /Library/Frameworks/R.framework/Resources/lib
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib

You can verify by running sessionInfo() in R. You should see:

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

P.S. Use ln -sf libRblas.0.dylib libRblas.dylib to revert back to default.

Benchmark 1

library(microbenchmark)
d <- 1e3
x <- matrix(rnorm(d^2), d, d)
microbenchmark(tcrossprod(x), solve(x), svd(x), times = 10L)

With R's default (libRblas.0.dylib):

Unit: milliseconds
          expr       min        lq      mean    median        uq       max
 tcrossprod(x)  276.9307  278.8472  283.9451  282.5854  288.7777  299.6700
      solve(x)  681.5183  696.7888  703.4895  703.6030  712.5728  724.7195
        svd(x) 2672.3574 2692.6614 2701.0442 2698.4279 2722.2180 2730.9727

With Apple's vecLib (libBLAS.dylib):

Unit: milliseconds
          expr        min        lq      mean    median        uq       max
 tcrossprod(x)   8.642364  10.59531  10.60989  10.80781  11.35995  11.62067
      solve(x)  37.205422  41.38051  43.77767  42.34975  42.92999  54.71387
        svd(x) 278.528673 290.26926 304.74482 307.49666 317.56819 325.29980

Benchmark 2

library(microbenchmark)
set.seed(20190604)
n <- 1e3
p <- 1e2
b <- rnorm(p + 1, 0, 10)
x <- matrix(runif(n * p, -10, 10), ncol = p, nrow = n)
y <- cbind(1, x) %*% b + rnorm(n, 0, 2)
microbenchmark(lm(y ~ x))

With R's default (libRblas.0.dylib):

Unit: milliseconds
           expr     min       lq     mean   median       uq      max neval
      lm(y ~ x) 17.0145 20.27288 22.58367 22.16553 24.49879 37.37704   100

With Apple's vecLib (libBLAS.dylib):

Unit: milliseconds
           expr      min       lq    mean   median       uq      max neval
      lm(y ~ x) 5.478486 7.888307 9.210001 8.84202 10.30503 15.80064   100

My recipe for the best breakfast potatoes (and terrific bacon)

Everyone I treat with these bomb-ass potatoes always tells me how amazing they are and it's a bit of an elaborate process to describe, so I decided to write it up here. There are actually two recipes in this post and one is (kind of) a prerequisite for the other, but if you're vegetarian/vegan or don't eat pork for religious (or other) reasons, feel free to skip to the second stage.

Stage 1: Bacon (Optional)

As with many great meals, the process starts with bacon. I don't fry or microwave my bacon: I bake it. The key is to get thick cut bacon – not the regular kind – and then lay it out on a grill rack. You can use a pan to catch the grease, but I usually make one out of (recycled) alumnium foil because even though I end up saving most of the grease, I still don't want any leftover grease going into the drain and clogging up the pipes.

Once I have the thick cut bacon laid out, I sprinkle it with garlic powder and crushed red pepper flakes before putting it in the oven at 375F for 20-30 minutes (depending on the thickness of the bacon). About 15-20 minutes in I flip the bacon to get both sides crispy. This not only yields perfectly crisped (not too dry) garlic-infused bacon with a slight kick, it yields amazing grease for frying. I keep a cup in the fridge with the saved grease and use that instead of butter or oil when I fry eggs and potatoes.

Stage 2: Potatoes

You're gonna want to use a mandoline slicer because while cutting by hand with a knife is totally fine, a dedicated tool definitely helps get nice, even, thin slices. Depending on what kind of potato you use you may not need to peel them but once they're ready, slice those bad boys up and submerge them in a saline solution. I put the sliced potatoes in a big bowl first, fill it up with water so they're fully submerged, dump a bunch (a couple of table spoons) of salt in there, and then mix it with my hands. Then I cover the bowl and let it sit for at least 45 minutes, but preferably 1-1½ hours.

This is the key step and it relies on osmosis. Basically, the salt water solution extracts moisture from those potato slices and enables them to get a real good crisp when you fry them. (BTW: if you want to use Wikipedia hover-cards – also known as page previews – on your site or blog, see these details and instructions.)

Once the potatoes have soaked in the solution long enough, the trick is to transfer the slices to a salad spinner, run them under regular water to wash away the salt on the surface (otherwise you'll get extremely salty 'tatoes), and then use the spinner to remove that excess moisture. For me this is enough but if you want to go the extra mile (why not at this point, right???) you can also dry them on a paper towel.

Now you're ready to use that garlic-inflused bacon grease with a kick (or vegetable oil) to fry those potatoes. The bigger the frying pan you use the better – you don't want layers upon layers of potato slices as that will mess with heat distribution – so as a rule of thumb I'd say make sure the pile of potatoes you end up with isn't taller than 3-slices high. Then 15-20 minutes of frying (with a lid covering them for the first 5-10 minutes) on medium-high heat should be enough for a perfect, not-too-dry crisp; and as always: monitor, stir, and taste often so as to not under- or overcook. (Just don't taste in the first 10 minutes because raw potato is toxic.)

Data Analyst vs Data Scientist: Industry Perspectives

Both "Data Analyst" (DA) and "Data Scientist" (DS) are titles that vary greatly between industries and even amongst individual organizations within industries. As the roles behind titles change over time, it is natural for some teams to ask themselves the following questions: should we have distinct roles or just stick to one? How would we differentiate the roles in a way that fulfills our organization's needs and is generally consistent with similar organizations? Do we want to consider a DS to be equivalent to a Sr. DA, the only difference being the title? Answering these questions not only establishes clear responsibilities and expectations, but enables hiring managers and recruiters to communicate clearly with potential applicants in the future (in job postings, for example).

"data scientist vs data analyst" search results

Search the Internet for "data scientist vs data analyst" and you will find plenty of people who don't know what the difference is (nor if there even is one anymore), and you will find plenty of people who think they know the definitions and differences. You will find an abundance of opinions but very little consistency! When I asked my followers on social media what they personally think the differences are, not everyone shared the same opinion but some interesting camps of thought emerged.

This is my effort to summarize the many replies I received, so here are certain important points, recurring themes, and somewhat overlapping camps of thought:

  • Single/primary distinction: DS is a DA who can code
    • In summary: the kinds of questions that a DA can answer and the kinds of tasks a DA can work on are a subset of DS’s because GUI tools limit what can be done, but a DS – by knowing programming – can answer way more kinds of questions and work on way more kinds of tasks.
    • Leads to reproducibility1, scalability
    • See discussion with Hadley Wickham
  • Single/primary distinction: statistical and machine learning (ML) modeling
  • No DAs, just two types of DSs: “Type A” vs “Type B” (refer to Doing Data Science at Twitter) came up a few times
  • Emily Robinson brought up that “Data Scientist” is now also used as an umbrella term and specialties are specified in the title as needed4
  • Some big tech companies like Facebook, Spotify, and some departments within Apple are moving away from having DAs to just having DSs5
  • Practical considerations for NY/SF/Austin tech scene:
    • “DS title will need a higher salary.”
    • “You will lose talent because of the DA title. It is seen as less prestigious.”
    • “You may have to work harder for diverse pool of applicants w/ DS title.”
      • “That latter comes from one company I know who’s had a harder time getting female applicants for DS positions vs DA (when they’re fairly similar responsibilities)”6
  • Lucas Meyer voiced support for a classic: Drew Conway’s infamous Venn diagram7
  • A coworker of mine shared that at one of his previous employments his organization identified three data scientist personas/profiles:
    • DS, Operations provides data & insights for resourcing decisions through ad-hoc analyses, dashboards, defining KPIs, and A/B testing.
      • This is the role of a Data Scientist in Product who creates reports and dashboards for management and executives. - MP
    • DS, Product delivers data science as product (and not to be confused with Data Scientists in Product). These folks build predictive models, AIs, matchmaking systems.
      • In some orgs this might be an ML Engineer or an AI Engineer or just a Data Scientist? - MP
    • DS, Research experiments and innovates. Not everything they work on ends up in production or utilized, but they are free to be creative and take chances.
      • In some orgs this might be the Research Scientist? - MP
    • Thinking of it this way, you might envision a scenario/pipeline wherein a Research DS prototypes a new recommender system (RS) algorithm, then an Operations DS helps determine (through A/B testing and qualitative user research together with a Design/UX Researcher) whether it's worth the costs to productionize (perhaps with the input of a Business/Financial Analyst), and then a Product DS scales the RS (possibly in collaboration with a Data Engineer) and deploys it to production. - MP

Closing thoughts

I hope for some that this is an eye-opening moment and that they now realize that there's no single distinction everyone agrees on. Everyone is coming into it with their own backgrounds, experiences, thought processes, and ideas. None of these are wrong! So if you're in a hiring position, please remember to be specific when writing a job description. You can't just write "Data Analyst" or "Data Scientist" at the top and expect everyone else to share your assumptions, it's a recipe for misunderstanding and failure.

I would like to thank everyone who responded, and especially Emily Robinson and Renee M. P. Teate. Thank you everybody for taking the time to write and in some cases discuss nuances in spun-off threads! If you want to explore all the replies yourself, here's root.

I would also like to point out that is not even representative of how data professionals perceive these roles globally. All of the responses were from English-literate people, most (if not all) of the responses were from people living and working in U.S., and many of them are specifically people who follow me on Twitter. I know for a fact that there are so many more data professionals (data engineers have opinions on this too!) who aren't in any of those groups. These are professionals who have their own perceptions, who operate in different cultures, under different expectations all across the world, and someone out there is probably writing a similar post within their own community.

Resources for learning to visualize data with R/ggplot2

"I'm currently learning visualisation with R/ggplot2 and was wondering whether you could share tips/links/videos/books/resources that helped you in your journey"

Raya via Twitter

Tips

The only tip I'll give is that you should strive to make every chart look exactly how you want it to look and say exactly what you want it to say. You will learn in the process of doing. When it's time to visualize the data and you have an idea for a very specific look and story, don't give it up or compromise on your vision just because you don't know how to do it. Trust me, there is so much documentation out there and so many posts on Stack Overflow that you will be able to figure it out. (But also it's totally fine to get 90-95% of the way there and call it done if that last 5-10% sprint is driving you bonkers.)

Here are some charts I made in the past two and a half years at The Wikimedia Foundation:

In each case I distinctly remember wanting to express an idea in a specific way and had to learn something new or learn how to think about a problem in a new way. If you click on one of those, you will see a link to its source code on GitHub that you are welcome to look at and learn from.


Resources

I would recommend checking out the following:

…and many more! There are a lot of folks sharing their knowledge and techniques out there, and I cannot stress enough how many of them you can find out by going through Mara's exhaustive catalogue of cool things made by people in R.


Also, it just dawned on me that I can totally make this a series. I've previously written blog posts as responses to the following questions:

So I've gone back and added the new "Ask Popov" tag to those. Hopefully there will be more questions because these advice posts feel good to write!

The journey so far…

I recently received an email which said, "I'm interested in learning more about you and your journey to where you are today," so I thought I'd describe how I went from studying visual arts to analyzing data at Wikimedia Foundation (WMF).

Growing up I excelled in visual arts and mathematics at school, and they continued to be my strongest subjects. My parents and I immigrated to US from Russia when I was 10, and I spent the first few years focused on learning English – which was especially difficult because I was the only Russian-speaking person at my school. I was okay at English when I entered 6th grade, having learned a lot of it from The Simpsons of all things. That was also the year I joined band and started learning trombone, but that wouldn't last.

When I started 7th grade (in junior high school aka middle school), I didn't want to continue band so for my elective I opted into web design and that got me writing HTML and JavaScript. Soon after that I'd go home and learn to make apps in Flash using ActionScript that did server-side stuff with PHP and MySQL. I continued to develop my creative side by learning 3D modeling & rendering in LightWave, video editing in Premiere Pro, graphic design in Illustrator, and illustration in Photoshop and used those skills to make short films with my friends.

I was blessed with supportive and trusting parents. Back in Russia my friends and I would stay out late without adult supervision and it was fine. My earned my parents' trust by showing them I was responsible and didn't get into trouble, so I enjoyed a lot of autonomy in US while my parents were at work. I led a pretty balanced and diversified life. After school I split my time pretty evenly between hanging out with friends, learning cool tech things, doing homework, and playing PC games like The Sims, The Witcher, Counter-Strike. At school I'd take the Honor/AP version of every class I could and perform in plays & musical, while also hanging out with what some might call "bad apples" (guys who smoked and got into fights or joined gangs).

I entered California State University - Fullerton (CSUF) with the intention to do a double major in visual arts (concentration in illustration) and mathematics (concentration in pure math). I've been doing art for years at that point (here is some of my stuff), so I wanted to become a professional. I also wanted to study math because I really enjoyed AP Calculus BC, but I specifically wanted to study pure mathematics because I wanted to go into cryptanalysis after reading The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography.

Since I was putting quite a lot of hours into my job at a grocery store to pay for school, I realized after my first semester that I just didn't have time or energy to pursue both of my passions. So I had a decision to make, and I chose math over art. That was a major turning point in my life, and I still second-guess myself about it every now and then (despite a successful career in data science).

I had several amazing math teachers in high school, and one of them introduced us to rings, groups, and fields from abstract algebra in a two-week detour during Honors Algebra 2, and I really liked that stuff! Even though I hated probability problems all throughout high school, a year into my undergraduate education, something in my head flipped. I found a calling in another branch of mathematics — a more applied branch based on probability — and I switched my concentration from Pure Mathematics to Statistics.

Dr. Sam Behseta took me under his wing as my advisor and invited me to get involved in an undergraduate research project dealing with application of statistics to neuroscience, along with two other students. We performed an assessment of several probability-distance measures such as Kullback–Leibler divergence for comparing neuron firing counts from peristimulus time histograms (PSTHs). I went on to do another undergraduate research project with him, this time using model-based clustering to find common patterns in PSTHs.

I wasn't sure about going on to do a PhD in Statistics, but I wanted to learn more. I applied as a Statistician to Valve Software, but was rejected after the phone screen. I needed to learn more before I could get a job in the industry. I applied to Carnegie Mellon University's Master in Statistical Practice (MSP) program run by Drs. Joel Greenhouse and Howard Seltman and got extremely lucky because I was actually admitted. I packed up my things and moved across the country to Pittsburgh, PA, where I learned data mining, time series analysis, survival analysis, and even got to contribute my skills to Carnegie Mellon University Department of Statistics Census Research Node (one of the few NSF-Census Research Network nodes working with U.S. Census Bureau in preparation for the 2020 census).

The focus of the MSP program was statistical consulting, and our final project involved actual consulting for actual clients. In my case, my partner and I performed statistical analysis of fMRI data for Dr. Mina Cikara. At the same time, Dr. James T. Becker at University of Pittsburgh (in partnership with CMU via CNBC) was looking for a jack-of-all-trades for his neuropsychology research program (NRP) at University of Pittsburgh Medical Center (UPMC). I've been doing neuroscience-related stuff for two years at that point and I liked it, so this opportunity was a natural fit.

I worked at NRP/UPMC for the next two years, analyzing MRI scans for studies of Alzheimer's disease and HIV-associated neurocognitive disorders. I also organized the past twenty years of data, performed ad-hoc analyses, and developed a novel method of calculating associations between MRI data and Bayesian posterior parameter estimates. But this was a soft-money position (I was hired on a grant) and I couldn't stay with them longer than those two years, so I started looking for other opportunities.

In the months leading up to that, I had serendipidously become friends with Os Keyes — the sole data analyst in the Discovery department at WMF at the time — on Twitter. We connected through our mutual love for R and social justice, and when they found out that I was looking for a job, they suggested I apply to the opening for Discovery's second data analyst because my statistics-focused skillset was a complement to their skillset. Of course I applied to work at Wikipedia! After going through the interview process (which was the basis for the one I would later write about), I was offered the job and Oliver and I became a team. This was two and a half years ago.

I don't know what's next in store for me. Two years in, and I'm still very happy at WMF and I get to perform lots of really cool analyses. Heck, I'm even supported in learning (and encouraged to learn) new things that aren't directly related to my job description.

Cheers~

Update (2018-08-17): The thing that was missing from the original version of this post is that as my employment at NRP/UPMC was coming to an end, I applied to several PhD programs in Statistics and Statistical Computing. The schools I applied to included University of Washington, UC Davis, and my alma mater CMU. I was rejected from all of them, which was absolutely crushing at the time and sent me to a dark mental and emotional place. However, I still had nothing lined up so I decided to stop pursuing further education and instead started looking for jobs in the industry.