In 2006, UK mathematician and Tesco Clubcard architect Clive Humbly coined the phrase “Data is the new oil.” He said the following:
“Data is the new oil. It’s valuable, but if unrefined, it cannot be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity. So, data must be broken down, analyzed for it to have value.”
The iPhone revolution, growth of the mobile economy, and advancements in big data technology have created a perfect storm. In 2012, Harvard Business Review published an article that put data scientists on the radar. The article Data Scientist: The Sexiest Job of the 21st Century labeled this new breed of people as a hybrid of data hacker, analyst, communicator, and trusted advisor.
Every organization is now making attempts to be more data-driven. Machine learning techniques have helped them in this endeavor. I realize that a lot of the material out there is too technical and difficult to understand. In this series of articles, my aim is to simplify data science. I will take my cue from the Stanford book An Introduction to Statistical Learning.
Data science is a multi-disciplinary field. It is the intersection between the following domains:
- Business knowledge
- Statistical learning (aka machine learning)
- Computer programming
In this article, I will begin by covering principles, general processes, and types of problems in the field.
1. Data is a strategic asset
This concept is an organizational mindset. The questions to ask are “Are we using all the data assets that we are collecting and storing?” and “Are we able to extract meaningful insights from them?” I’m sure that the answers to these question are no. Cloud-born companies are intrinsically data-driven, as it is in their psyche to treat data as a strategic asset. This mindset is not valid for most of the organization.
2. A systematic process for knowledge extraction
A methodical process needs to be in place for extracting insights from data. This process should have clear and distinct stages with clear deliverables. The Cross Industry Standard Process for Data Mining (CRISP-DM) is one such process.
3. Sleeping with the data
Organizations need to invest in people who are passionate about data. Transforming data into insight is not alchemy. Thus, they don’t need alchemists, they need evangelists who understand the value of data. They need evangelists who are data literate and creative. They need folks who can connect data, technology, and business.
4. Embracing uncertainty
Data science is not a silver bullet nor a crystal ball. Like reports and KPIs, it is a decision enabler. Data science is a tool and not a means to an end. It is not in the realm of the absolute but in the realm of probabilities. Managers and decision makers need to embrace this fact. They need to embrace quantified uncertainty in their decision-making process. Such uncertainty can only be entrenched if the organizational culture adopts a “fail fast, learn fast” approach. It will only thrive if organizations choose a culture of experimentation.
5. The BAB principle
I perceive this as the most important principle. The focus of a lot of data science literature is on models and algorithms. The equation is devoid of business context. Business-Analytics-Business (BAB) is the principle that emphasizes the business part of the equation. Putting the equation in a business context is pivotal. Define the business problem and use analytics to solve it then, integrate the output into the business process. BAB.
Taking a cue from principle two, let me now emphasize on the process part of data science. The following are the stages of a typical data science project:
1. Define business problem
Albert Einstein once said, “Everything should be made as simple as possible, but not simpler.” This quote is the crux of defining the business problem. Problem statements need to be developed and framed. Clear success criteria need to be established.
In my experience, business teams are too busy with their operational tasks at hand. But this doesn’t mean they don’t have challenges that need to be addressed. Brainstorming sessions, workshops, and interviews can help to uncover these challenges and develop hypotheses.
Let me illustrate this with an example: Let us assume that a telco company has seen a decline in their year-on-year revenue due to a reduction in their customer base. In this scenario, the business problem may be defined as “The company needs to grow the customer base by targeting new segments and reducing customer churn.”
2. Decompose to machine learning tasks
The business problem, once defined, needs to be decomposed into machine learning tasks. Let’s elaborate on the example that we have set above. If the organization needs to grow the customer base by targeting new segments and reducing customer churn, how can we decompose it into machine learning problems? The following is an example of decomposition:
- Reduce the customer churn by x percent.
- Identify new customer segments for targeted marketing.
3. Data preparation
Once we have defined the business problem and decomposed it into machine learning problems, we need to dive deeper into the data. Data understanding should be explicit to the problem at hand. It should help us with developing the right kind of strategies for analysis. Some key things to note are the source of data, quality of data, data bias, etc.
4. Exploratory data analysis
A cosmonaut traverses through the unknowns of the cosmos. Similarly, a data scientist traverses through the unknowns of the patterns in the data, peeks into the intrigues of its characteristics, and formulates the unexplored. Exploratory data analysis (EDA) is an exciting task. It’s how we get to understand the data better, investigate the nuances, discover hidden patterns, develop new features, and formulate modeling strategies.
After EDA, we move on to the modeling phase. Here, based on our specific machine learning problems, we apply useful algorithms like regressions, decision trees, random forests, etc.
6. Deployment and evaluation
Finally, the developed models are deployed. They are continuously monitored to observe how they behaved in the real world and calibrated accordingly.
Typically, the modeling and deployment part is only 20 percent of the work. The remaining 80 percent is getting your hands dirty with data and exploring and understanding it.
Machine learning problem types
In general, machine learning has two kinds of tasks:
Supervised learning is a type of machine learning task where there is a defined target. Conceptually, a modeler will supervise the machine learning model to achieve a particular goal. Supervised learning can be further classified into two types:
Regression is the workhorse of machine learning tasks. They are used to estimate or predict a numerical variable. A few questions that regression models can answer are:
- What is the estimate of the potential revenue next quarter?
- How many deals can I close next year?
As the name suggests, classification models classify something. It estimates which bucket something is best suited to be placed in. Classification models are frequently used in all types of applications. Here are a few examples of how classification models are used:
- Spam filtering: This is a popular implementation of a classification model. Here, every incoming email is classified as spam or not spam based on certain characteristics.
- Churn prediction: This is another important application of classification models. Churn models are used widely in telcos to classify whether a given customer will churn (i.e. cease to use the service) or not.
Unsupervised learning is a class of machine learning tasks where there are no targets. Since unsupervised learning doesn’t have any specified target, the results that they churn out may be sometimes difficult to interpret. There are a lot of types of unsupervised learning tasks. The key ones are:
- Clustering: Clustering is a process of grouping similar things together. Customer segmentation uses clustering methods.
- Association: Association is a method of finding products that are frequently matched with each other. Market basket analysis in retail uses the association method to bundle products together.
- Link prediction: Link prediction is used to find the connection between data items. Recommendation engines employed by Facebook, Amazon, and Netflix heavily use link prediction algorithms to recommend us friends, items to purchase, and movies, respectively.
- Data reduction: Data reduction methods are used to simplify data sets from a lot of features to a few features. It takes a large data set with many attributes and finds ways to express them in terms of fewer attributes.
Machine learning tasks to models to algorithms
Once we have broken down business problems into machine learning tasks, we can use one or many algorithms to solve a given one. Typically, the model is trained on multiple algorithms. The algorithm or set of algorithms that provide the best result is chosen for deployment.
Azure Machine Learning has more than 30 pre-built algorithms that can be used for training machine learning models. Here’s a cheat sheet to help you navigate through it:
Data science is a broad and exciting field. It is an art and a science. In this article, we have just explored the tip of the iceberg. The how’s will be futile if the why’s are not known. In the subsequent articles, we will explore the how’s of machine learning.