What’s The Difference Between Data Science and Statistics?
If you’ve considered data science as a career, you’ve probably already heard these jokes.
— Jeremy Jarvis (@jeremyjarvis) January 30, 2014
Data Science is statistics on a Mac.
— Big Data Borat (@BigDataBorat) August 27, 2013
There is some degree of truth to it since data science relies heavily on statistical knowledge. Being the new career path on the block, data science naturally receives flack compared to already established career paths. There is so much overlap between the data science and statistics that even Wikipedia’s data science page has a section to address concerns on whether data science is just rebranded statistics. Even famed data journalist Nate Silver of the page FiveThirtyEight, who correctly predicted Trump’s rise to the presidency through polling data, argued that “I think data-scientist is a sexed-up term for a statistician.”
But data science is more than just statistics. Here are 3 key differences between the two: (See also our article on skills needed by data scientists.)
Before we get into the technical inner workings of both fields, let’s look into their definitions.
Paraphrasing from Technopedia, data science is defined as the collective process and theoretical knowledge of managing and deriving meaningful information from raw data. The goal is to help organisations, most of the time businesses, make decisions based on the data and what can be predicted or inferred from the data.
Merriam Webster defines statistics as a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.
While both deals with data, the nuance is that data science is more concerned with using techniques from multiple disciplines to extract meaningful information from data and act on that data, whereas statistics is a branch of mathematics that is more concerned with collecting and explaining the significance of data.
Approach to Problems
Both data science and statistics use models to solve problems. For the uninitiated, models are assumed relationships between two or more variables.
Because data science is used to predict actions based on informed insights, it uses multiple models and finds the best mathematical model to understand a range of data. The model with the highest prediction accuracy becomes the model used to explain the set of data. Statistics, however, does the opposite. It pre-selects a model and fits it within a data set. The model then gets modified according to the data. To quote Display; “while data science focuses on comparing many methods to create the best machine learning model, statistics instead improves a single, simple model to best suit the data.”
In simple terms, data science approaches problems by selecting models which can best be used to predict outcomes based on pre-existing data. This is the basis of artificial intelligence or AI. (See also our AI for Business Executives course.) In contrast, statistics approaches problems by selecting models which can best be used to explain pre-existing data.
Origins, Discipline, and Purpose
Most of data science practitioners come from an engineering background while statisticians mainly come from a mathematics background.
While the phrase ‘data science’ was coined in the 60s as a substitute for computer science by computer scientist Peter Naur, it was only in the 90s that it was acknowledged as a separate discipline. Practitioners seriously described their job position as data scientists in the late 2000s and it was only in 2015 that the United States had the position of Chief Data Scientist.
Statistics instead started as a way to explain relationships before the advent of computers. To quote Atul Bhardwarj on Pricenomics;
“Statistics was primarily developed to help people deal with pre-computer data problems like testing the impact of fertiliser in agriculture, or figuring out the accuracy of an estimate from a small sample.”
From the brief history above, we know that statistics and data science came from different backgrounds. Statistics stemmed from explanatory roots, where it is more concerned with the theoretical aspects of data and testing of scientific hypotheses. Statistics as a discipline is meant to explain the world. Data science, on the other hand, came from a computer science background, making it more task-oriented. Data science as a discipline is meant to predict outcomes and use them to accomplish tasks.
That’s why AI is heavily informed by data science. It uses predictive models from data science. Earlier models of AI make decisions based on rules, rather than data. Modern AI makes decisions according to prediction models that are based on data science. This explains why data science relies on programming languages like R or Python to clean large data sets. (See also, our Python vs R comparison article.)
The next time someone says data science is just rebranded statistics, you know how to give the right response. To keep it short, data science is used to predict while statistics is used to infer.
Seeking a career in data science? Or are you planning to prep your data team to become data scientists? Our CDSS certification might be the answer you need. iTrain offers in-demand digital technology certifications and has trained thousands of IT teams. Contact our iTrain course consultants at +603-2733 0337 or email firstname.lastname@example.org to find out more about our training courses that will help your good self or organisation stay ahead of the curve.
[Just in: Now with our Maybank ZERO% interest-free 18-month instalment plan, training costs just got a whole lot more affordable! Get the exact cost breakdown here.]