The enterprise applications of machine learning are weaving themselves into the fabric of everyday business.
Still, the concept itself is hazily understood.
This article continues that trend by tackling one of the least helpful misapplications: when machine learning and data science are mistaken for each other.
Laying the Groundwork
Machine learning is a branch of artificial intelligence where, instead of writing a specific formula to produce a desired outcome, an algorithm “learns” the model through trial and error.
It uses what it learns to refine itself as new data becomes available.
Data Science is an umbrella term that includes everything needed to extract meaningful insights from data (gathering, scrubbing and preparing, analyzing, forming predictions) in order to answer questions or make predictions.
It includes areas like:
- Data mining: The process of examining large amounts of data to find meaningful patterns
- Data scrubbing: Finding and correcting incomplete, unformatted, or otherwise flawed data within a database
- ETL (Extract, Transform, Load): a collective term for the process of pulling data from one database and importing it into another
- Statistics: Collecting and analyzing large amount of numerical data, particularly to establish the quantifiable likelihood of a given occurrence
- Data visualization: Presenting data in a visual format (charts, graphs, etc) to make it easier to understand and spot patterns
- Analytics: A multidisciplinary field that revolves around the systematic analysis of data
What Falls Under the “Data Science Umbrella?”
“Data scientists are kind of like the new Renaissance folks, because data science is inherently multidisciplinary.”
Those words from John Foreman, MailChimp’s VP of Product Management, sum up the problem with trying to draw the boundaries of data science.
It’s a vast concept, describing intent more than a specific discipline.
There are, however, four fields generally agreed to cover the majority of data science where they intersect: mathematics, computer science, domain expertise, and communications.
- Mathematics: Mathematics forms the core of data science. Data scientists need to know enough math to choose and refine the models they use in analysis, especially if they plan to work in machine learning. Understanding the math behind their formulas gives them the ability spot errors and weigh the significance of results.Also, while there are some data points that can be easily read without a heavy math background (conversions, website views, engagement rates, etc), others require specialized knowledge to understand. For example, time series data is very common in business intelligence but hard for casual users to interpret.Mathematical subdisciplines often studied by data scientists include:
- Statistics (including multivariate testing, cross-validation, probability)
- Linear Algebra
- Computer science: Data science may be older than computers, but the powerful effect of the digital revolution can’t be denied. Computers let data scientists process vast amounts of data and perform incredibly complex calculations at a speed that allows data to be used within a reasonable timeframe.Some of the areas where computer science intersects with data science:
- System design optimization
- Cleaning/scrubbing data
- Graph theory and distributed architectures
- Programming databases
- Artificial Intelligence and machine learning
- Domain knowledge: Data science is a targeted practice. It’s used to generate insights about some specific topic. The data has to be contextualized before it can be put to use, and doing so effectively requires an in-depth knowledge of that topic.Today data science is being applied in nearly every domain. Perhaps some of the most interesting uses can be found in fields like business and health care.
- Health care
- Data-driven preventative health care
- Disease modeling and predicting outbreaks
- Improving diagnostic techniques
- DNA sequencing and genomic technologies
- Business intelligence
- Identifying and quantifying business problems to make data-driven decisions
- Sorting and ranking customers
- Predictive inventory systems
- Monitoring and refining marketing strategies
- Health care
- Communications: Communications is often forgotten when discussing data science, but communication is relevant at nearly every stage of the data science process. It’s a critical link between theory and practice. Data has little value unless it can be applied to solve problems or answer questions, and it can’t be applied until someone other than the data scientist understands it. On the flip side of that statement, data scientists need to know what questions they’re trying to answer in order to choose the best analytical strategies.Though communications are often grouped with domain knowledge, it’s helpful to separate them to emphasize their importance. Here are a few data science-oriented applications of communications:
- Data science evangelism (spreading awareness about the uses of data science)
- Clarifying what is needed/desired from data
- Presenting results in a useful way
- Data visualization (graphs, charts, models)
The Data Science Process
If separating data science into the above disciplines were easy, though, it wouldn’t be its own field.
In reality each discipline is woven throughout the process with a large degree of flexibility in the combination of techniques used.
Here’s a general, very broad-scope view of the data science process and the disciplines that affect each stage.
- Data is collected and stored. Computer science
- Questions are asked. (What is needed from the data? What problems does the user hope to solve?) Communications, Domain knowledge
- Data is cleaned and prepared for analysis. Math, Computer science
- Data enrichment takes place. (Do you have enough data? How can it be improved?) Computer science, Math, Communications, Domain knowledge
- A data scientist decides which algorithms and methods of analysis will best answer the question or solve the problem. Math, Computer science
- Data is analyzed via Artificial Intelligence/machine learning, statistical modeling, or another method. Math, Computer science
- The results are measured and evaluated for value/merit. Math
- The validated results are brought to the end user. Communication, possibly computer science
- The end user applies the results of data science to real-world business problems. Business, communication
This list is mainly intended to demonstrate how inextricably combined the component disciplines of data science are in practice.
The data science process is never as straightforward as this; rather, it’s highly iterative. Some of these steps may be repeated many times.
Depending on the results, the scientist might even return to an earlier step and start over.
Where the Confusion Lies
After reading this far, the reasons for the confusion between data science and machine learning have likely become clear.
Machine learning is a method for doing data science more efficiently, so it’s misunderstood to be a direct subdiscipline of data science.
In fact, looking at a list of things data science can accomplish reads like a pitch list for adopting machine learning.
Here are a few common data science applications to illustrate the point:
- Forecasting/predicting future values
- Classification and segmentation
- Scoring and ranking
- Making recommendations
- Pattern detection and grouping
- Detecting anomalies
- Recognition (image, text, audio, video, facial, …)
- Generating actionable insights
The reason for this overlap is that machine learning algorithms are very effective tools for sorting and classifying data.
That makes machine learning popular among data scientists, but it doesn’t have the inherent direction and sense of purpose of data science as a whole.
In simple terms: machine learning is a tool, data science is a field of practice.
Machine Learning Isn’t Necessary for Data Science…
While ML is an efficient way of performing data science, it’s not always the best solution. Sometimes it isn’t needed at all. Two notable cases when machine learning is the wrong tool for a job:
- The problem can be solved using set formulas or rules. If there’s no interpretation needed and context doesn’t change the data, a mathematical model alone can handle the matter. There’s no point in spending resources on machine learning. It might lead to faster results if there’s a large amount of data, but it won’t produce “better” results.
- There isn’t a massive amount of data involved. This is a case where machine learning does more harm than good. Machine learning requires data, the more the better. Without a store of prepared data to train the algorithm, it can produce unreliable results. Worse, training on a small or unrepresentative sample yields biased results. When there isn’t enough relevant data on a subject to fuel machine learning, other methods of data science are better options for finding answers.
But It Is a Game-changing Advantage.
Despite these limitations, machine learning offers such a distinct advantage that easy to see why data scientists are adopting it in such large numbers.
There are three main situations where it’s generally the best data science method:
- There’s too much data for a human expert to process. Some data is perishable. By the time a team of human analysts works through it (even using standard computing methods) it’s aged out of usefulness. Other times data is flowing into a system faster than it can be processed. Machine learning algorithms thrive on massive amounts of data. They improve by processing data, so results actually become more accurate over time.
- There is ambiguity in the ruleset. Machine learning has a long way to go before it can match the human potential for coping with uncertainty and inconsistency, but it’s made huge strides in drawing meaningful results from ambiguous data.
- Programming a specific solution isn’t practical. Sometimes the code needed to program a solution is so big that doing so would be inefficient. In these cases, machine learning can be used to streamline the analysis process.
The Bottom Line
It’s definitely possible to do data science without incorporating machine learning.
However, the pace of data production is growing every day.
By 2020, 1.7 megabytes of data will be created every second per living person.
Most of that will be unstructured data.
Machine learning is the best tool for dealing with that volume and quality of data, so it’s likely to be used in data science for the foreseeable future.
How well is your company taking advantage of its data? Contact Concepta to learn how we can turn your data into actionable insights!