Data Analyst vs Data Scientist: Industry Perspectives

Both "Data Analyst" (DA) and "Data Scientist" (DS) are titles that vary greatly between industries and even amongst individual organizations within industries. As the roles behind titles change over time, it is natural for some teams to ask themselves the following questions: should we have distinct roles or just stick to one? How would we differentiate the roles in a way that fulfills our organization's needs and is generally consistent with similar organizations? Do we want to consider a DS to be equivalent to a Sr. DA, the only difference being the title? Answering these questions not only establishes clear responsibilities and expectations, but enables hiring managers and recruiters to communicate clearly with potential applicants in the future (in job postings, for example).

"data scientist vs data analyst" search results

Search the Internet for "data scientist vs data analyst" and you will find plenty of people who don't know what the difference is (nor if there even is one anymore), and you will find plenty of people who think they know the definitions and differences. You will find an abundance of opinions but very little consistency! When I asked my followers on social media what they personally think the differences are, not everyone shared the same opinion but some interesting camps of thought emerged.

This is my effort to summarize the many replies I received, so here are certain important points, recurring themes, and somewhat overlapping camps of thought:

  • Single/primary distinction: DS is a DA who can code
    • In summary: the kinds of questions that a DA can answer and the kinds of tasks a DA can work on are a subset of DS’s because GUI tools limit what can be done, but a DS – by knowing programming – can answer way more kinds of questions and work on way more kinds of tasks.
    • Leads to reproducibility1, scalability
    • See discussion with Hadley Wickham
  • Single/primary distinction: statistical and machine learning (ML) modeling
  • No DAs, just two types of DSs: “Type A” vs “Type B” (refer to Doing Data Science at Twitter) came up a few times
  • Emily Robinson brought up that “Data Scientist” is now also used as an umbrella term and specialties are specified in the title as needed4
  • Some big tech companies like Facebook, Spotify, and some departments within Apple are moving away from having DAs to just having DSs5
  • Practical considerations for NY/SF/Austin tech scene:
    • “DS title will need a higher salary.”
    • “You will lose talent because of the DA title. It is seen as less prestigious.”
    • “You may have to work harder for diverse pool of applicants w/ DS title.”
      • “That latter comes from one company I know who’s had a harder time getting female applicants for DS positions vs DA (when they’re fairly similar responsibilities)”6
  • Lucas Meyer voiced support for a classic: Drew Conway’s infamous Venn diagram7
  • A coworker of mine shared that at one of his previous employments his organization identified three data scientist personas/profiles:
    • DS, Operations provides data & insights for resourcing decisions through ad-hoc analyses, dashboards, defining KPIs, and A/B testing.
      • This is the role of a Data Scientist in Product who creates reports and dashboards for management and executives. - MP
    • DS, Product delivers data science as product (and not to be confused with Data Scientists in Product). These folks build predictive models, AIs, matchmaking systems.
      • In some orgs this might be an ML Engineer or an AI Engineer or just a Data Scientist? - MP
    • DS, Research experiments and innovates. Not everything they work on ends up in production or utilized, but they are free to be creative and take chances.
      • In some orgs this might be the Research Scientist? - MP
    • Thinking of it this way, you might envision a scenario/pipeline wherein a Research DS prototypes a new recommender system (RS) algorithm, then an Operations DS helps determine (through A/B testing and qualitative user research together with a Design/UX Researcher) whether it's worth the costs to productionize (perhaps with the input of a Business/Financial Analyst), and then a Product DS scales the RS (possibly in collaboration with a Data Engineer) and deploys it to production. - MP

Closing thoughts

I hope for some that this is an eye-opening moment and that they now realize that there's no single distinction everyone agrees on. Everyone is coming into it with their own backgrounds, experiences, thought processes, and ideas. None of these are wrong! So if you're in a hiring position, please remember to be specific when writing a job description. You can't just write "Data Analyst" or "Data Scientist" at the top and expect everyone else to share your assumptions, it's a recipe for misunderstanding and failure.

I would like to thank everyone who responded, and especially Emily Robinson and Renee M. P. Teate. Thank you everybody for taking the time to write and in some cases discuss nuances in spun-off threads! If you want to explore all the replies yourself, here's root.

I would also like to point out that is not even representative of how data professionals perceive these roles globally. All of the responses were from English-literate people, most (if not all) of the responses were from people living and working in U.S., and many of them are specifically people who follow me on Twitter. I know for a fact that there are so many more data professionals (data engineers have opinions on this too!) who aren't in any of those groups. These are professionals who have their own perceptions, who operate in different cultures, under different expectations all across the world, and someone out there is probably writing a similar post within their own community.

Resources for learning to visualize data with R/ggplot2

"I'm currently learning visualisation with R/ggplot2 and was wondering whether you could share tips/links/videos/books/resources that helped you in your journey"

Raya via Twitter

Tips

The only tip I'll give is that you should strive to make every chart look exactly how you want it to look and say exactly what you want it to say. You will learn in the process of doing. When it's time to visualize the data and you have an idea for a very specific look and story, don't give it up or compromise on your vision just because you don't know how to do it. Trust me, there is so much documentation out there and so many posts on Stack Overflow that you will be able to figure it out. (But also it's totally fine to get 90-95% of the way there and call it done if that last 5-10% sprint is driving you bonkers.)

Here are some charts I made in the past two and a half years at The Wikimedia Foundation:

In each case I distinctly remember wanting to express an idea in a specific way and had to learn something new or learn how to think about a problem in a new way. If you click on one of those, you will see a link to its source code on GitHub that you are welcome to look at and learn from.


Resources

I would recommend checking out the following:

…and many more! There are a lot of folks sharing their knowledge and techniques out there, and I cannot stress enough how many of them you can find out by going through Mara's exhaustive catalogue of cool things made by people in R.


Also, it just dawned on me that I can totally make this a series. I've previously written blog posts as responses to the following questions:

So I've gone back and added the new "Ask Popov" tag to those. Hopefully there will be more questions because these advice posts feel good to write!

Advice for graduates applying for data science jobs

Getting into a technical field like data science is really difficult when you're fresh out of school. On the off-chance that your potential employer actually gets the hiring process right, most organizations are still going to place a considerable amount of weight on experience over schooling. Like, yeah there are certain schools that make it a lot easier to go from academia to industry, but otherwise you're dealing with the classic catch-22 situation.

Something that can help you – and what I would notice when reviewing applications – is having something original and interesting (even if just to you) to show and talk about. It doesn't have to be published original research. It doesn't have to be a thesis. It just has to show that you can:

  • Work with real data: In most academic programs, methods are taught using clean, ready-to-use data. So it's important to show that you can take some data you found somewhere and process into something that you can glean insights from. It also gives you a chance to work with data about a topic that you personally find interesting. Possible sources of data include:
  • Explore it: Once you have a dataset that actually excites you, you should perform some EDA. Produce at least one (thoroughly labeled) visualization that shows some interesting pattern or relationship. I want to see your curiosity. I want to see an understanding that you can't just jump into model-fitting without developing some familiarity with your data first.
  • Analyze it: You're going to lose a lot of interest if you just show and talk about how you followed the steps of some tutorial verbatim. If you learn from the tutorial and then apply that methodology to a different dataset, that's basically what "experience" means. And don't try to use an overly complicated algorithm/model if the goal doesn't require it. You might get incredible accuracy classifying with deep learning, but you'll probably have a more interesting story to tell from inference with a logistic regression. Heck, at Wikimedia we use that in our anti-harassment research.
  • Present your work: It can be a neat report with an executive summary (abstract) or it can be an interactive visualization or a slide deck. Just something better than zip archive of scripts or Jupyter notebooks.
  • Explain your work (however complex) and results in a way that can be understood: This is where the first point is really important. If you're describing your analysis of data from a topic you're familiar with and are interested in, you're going to have a much easier time explaining it to a stranger. Be prepared to talk about it to a non-technical person. Be prepared to talk about it to a technical person who may not be familiar with your particular methodology. Your interviewer may have done a lot of computational lingustics & NLP but no survival analysis, so get ready to give a brief lesson on K-M curves (and vice versa).
  • Perform an analysis from start to finish: Because that's what we look for when we assign a take-home task to our candidates.

A lot of times the job postings will include a number of years as a requirement, but that's not as need-to-have as you or they might think. Secretely, it's actually a nice-to-have because "experience" is mostly a proxy for "candidate has previously used real data to solve a problem in a way that can be understood and used to inform a decision-making process." If you don't have experience, you can still demonstrate that you've done what a data scientist does.

Good luck~

Acknowledgement: I would like to thank Angela Bassa (Director of Data Science at iRobot) for her input on this post. In particular, the last paragraph is based entirely on her suggestions. She also created the Data Helpers website that lists data professionals who are able to answer questions, promote, or mentor newcomers into the field.

Mostly-free resources for learning data science

In the past year or two I've had several friends approach me about learning statistics because their employer/organization was moving toward a more data-driven approach to decision making. (This brought me a lot of joy.) I firmly believe you don't actually need a fancy degree and tens of thousands of dollars in tuition debt to be able to engage with data, glean insights, and make inferences from it. And now, thanks to many wonderful statisticians on the Internet, there is now a plethora of freely accessible resources that enable curious minds to learn the art and science of statistics.

First, I recommend installing R and RStudio for actually using it. They're free and what I use for almost all of my statistical analyses. Most of the links in this post involve learning by doing statistics in R.

Okay, now on to learning stats…

There's Data Analysis and Statistical Inference + interactive companion course by Mine Çetinkaya-Rundel (Duke University). She has also written the OpenIntro to Statistics book (available for free as a PDF).

Free, self-paced online courses from trustworthy institutions:

Not free online courses from trustworthy institutions:

Free books and other resources:

Book recommendations:

  • Introductory Statistics with R by Peter Dalgaard
  • Doing Data Science: Straight Talk from the Frontline by Cathy O'Neil
  • Statistics in a Nutshell by Sarah Boslaugh
  • Principles of Uncertainty by Jay Kadane (free PDF at http://uncertainty.stat.cmu.edu/)
  • Statistical Rethinking: A Bayesian Course with Examples in R and Stan by Richard McElreath

Phew! Okay, that should be enough. Feel free to suggest more in the comments below.