Cookie Policy

We use cookies to operate this website, improve usability, personalize your experience, and improve our marketing. Privacy Policy.

By clicking "Accept" or further use of this website, you agree to allow cookies.

Accept
Learn Machine Learning by Doing Learn Now
You are reading glossary
sklearn-binary-classifier-comparison.png
Fatih-Karabiber-profile-photo.jpg
Author: Fatih Karabiber
Ph.D. in Computer Engineering, Data Scientist

Binary Classification

LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.

What is Binary Classification?

In machine learning, binary classification is a supervised learning algorithm that categorizes new observations into one of two classes.

The following are a few binary classification applications, where the 0 and 1 columns are two possible classes for each observation:

ApplicationObservation01
Medical DiagnosisPatientHealthyDiseased
Email AnalysisEmailNot SpamSpam
Financial Data AnalysisTransactionNot FraudFraud
MarketingWebsite visitorWon't BuyWill Buy
Image ClassificationImageHotdogNot Hotdog

Quick example

In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as input features and predict whether the patient is healthy or has the disease. The possible outcomes of the diagnosis are positive and negative.

Evaluation of binary classifiers

If the model successfully predicts the patients as positive, this case is called True Positive (TP). If the model successfully predicts patients as negative, this is called True Negative (TN). The binary classifier may misdiagnose some patients as well. If a diseased patient is classified as healthy by a negative test result, this error is called False Negative (FN). Similarly, If a healthy patient is classified as diseased by a positive test result, this error is called False Positive(FP).

We can evaluate a binary classifier based on the following parameters:

  • True Positive (TP): The patient is diseased and the model predicts "diseased"
  • False Positive (FP): The patient is healthy but the model predicts "diseased"
  • True Negative (TN): The patient is healthy and the model predicts "healthy"
  • False Negative (FN): The patient is diseased and the model predicts "healthy"

After obtaining these values, we can compute the accuracy score of the binary classifier as follows: $$ accuracy = \frac {TP + TN}{TP+FP+TN+FN} $$

The following is a confusion matrix, which represents the above parameters:

In machine learning, many methods utilize binary classification. The most common are:

  • Support Vector Machines
  • Naive Bayes
  • Nearest Neighbor
  • Decision Trees
  • Logistic Regression
  • Neural Networks

The following Python example will demonstrate using binary classification in a logistic regression problem.

A Python example for binary classification

For our data, we will use the breast cancer dataset from scikit-learn. This dataset contains tumor observations and corresponding labels for whether the tumor was malignant or benign.

First, we'll import a few libraries and then load the data. When loading the data, we'll specify as_frame=True so we can work with pandas objects (see our pandas tutorial for an introduction).

import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer(as_frame=True)

The dataset contains a DataFrame for the observation data and a Series for the target data.

Let's see what the first few rows of observations look like:

dataset['data'].head()
Out:
mean radiusmean texturemean perimetermean areamean smoothnessmean compactnessmean concavitymean concave pointsmean symmetrymean fractal dimension...worst radiusworst textureworst perimeterworst areaworst smoothnessworst compactnessworst concavityworst concave pointsworst symmetryworst fractal dimension
017.9910.38122.801001.00.118400.277600.30010.147100.24190.07871...25.3817.33184.602019.00.16220.66560.71190.26540.46010.11890
120.5717.77132.901326.00.084740.078640.08690.070170.18120.05667...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902
219.6921.25130.001203.00.109600.159900.19740.127900.20690.05999...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758
311.4220.3877.58386.10.142500.283900.24140.105200.25970.09744...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300
420.2914.34135.101297.00.100300.132800.19800.104300.18090.05883...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678

5 rows × 30 columns

The output shows five observations with a column for each feature we'll use to predict malignancy.

Now, for the targets:

dataset['target'].head()
Out:
0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int32

The targets for the first five observations are all zero, meaning the tumors are benign. Here's how many malignant and benign tumors are in our dataset:

dataset['target'].value_counts()
Out:
1    357
0    212
Name: target, dtype: int64

So we have 357 malignant tumors, denoted as 1, and 212 benign, denoted as 0. So, we have a binary classification problem.

To perform binary classification using logistic regression with sklearn, we must accomplish the following steps.

Step 1: Define explanatory and target variables

We'll store the rows of observations in a variable X and the corresponding class of those observations (0 or 1) in a variable y.

X = dataset['data']
y = dataset['target']

Step 2: Split the dataset into training and testing sets

We use 75% of data for training and 25% for testing. Setting random_state=0 will ensure your results are the same as ours.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.25, random_state=0)

Step 3: Normalize the data for numerical stability

Note that we normalize after splitting the data. It's good practice to apply any data transformations to training and testing data separately to prevent data leakage.

from sklearn.preprocessing import StandardScaler

ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)

ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)

Step 4: Fit a logistic regression model to the training data

This step effectively trains the model to predict the targets from the data.

Step 5: Make predictions on the testing data

With the model trained, we now ask the model to predict targets based on the test data.

predictions = model.predict(X_test)

Step 6: Calculate the accuracy score by comparing the actual values and predicted values.

We can now calculate how well the model performed by comparing the model's predictions to the true target values, which we reserved in the y_test variable.

First, we'll calculate the confusion matrix to get the necessary parameters:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predictions)

TN, FP, FN, TP = confusion_matrix(y_test, predictions).ravel()

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)
Out:
True Positive(TP)  =  86
False Positive(FP) =  2
True Negative(TN)  =  51
False Negative(FN) =  4

With these values, we can now calculate an accuracy score:

accuracy =  (TP + TN) / (TP + FP + TN + FN)

print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))
Out:
Accuracy of the binary classifier = 0.958

Other binary classifiers in the scikit-learn library

Logistic regression is just one of many classification algorithms defined in Scikit-learn. We'll compare several of the most common, but feel free to read more about these algorithms in the sklearn docs here.

We'll also use the sklearn Accuracy, Precision, and Recall metrics for performance evaluation. See the docs here if you'd like to read more about the available metrics.

Initializing each binary classifier

To quickly train each model in a loop, we'll initialize each model and store it by name in a dictionary:

models = {}

# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression()

# Support Vector Machines
from sklearn.svm import LinearSVC
models['Support Vector Machines'] = LinearSVC()

# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier()

# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier()

Performance evaluation of each binary classifier

Now that we'veinitialized the models, we'll loop over each one, train it by calling .fit(), make predictions, calculate metrics, and store each result in a dictionary.

from sklearn.metrics import accuracy_score, precision_score, recall_score

accuracy, precision, recall = {}, {}, {}

for key in models.keys():
    
    # Fit the classifier
    models[key].fit(X_train, y_train)
    
    # Make predictions
    predictions = models[key].predict(X_test)
    
    # Calculate metrics
    accuracy[key] = accuracy_score(predictions, y_test)
    precision[key] = precision_score(predictions, y_test)
    recall[key] = recall_score(predictions, y_test)

With all metrics stored, we can use pandas to view the data as a table:

import pandas as pd

df_model = pd.DataFrame(index=models.keys(), columns=['Accuracy', 'Precision', 'Recall'])
df_model['Accuracy'] = accuracy.values()
df_model['Precision'] = precision.values()
df_model['Recall'] = recall.values()

df_model
Out:
AccuracyPrecisionRecall
Logistic Regression0.9580420.9555560.977273
Support Vector Machines0.9370630.9333330.965517
Decision Trees0.9020980.8666670.975000
Random Forest0.9720280.9666670.988636
Naive Bayes0.9370630.9555560.945055
K-Nearest Neighbor0.9510490.9888890.936842

Finally, here's a quick bar chart to compare the classifiers' performance:

ax = df_model.plot.barh()
ax.legend(
    ncol=len(models.keys()), 
    bbox_to_anchor=(0, 1), 
    loc='lower left', 
    prop={'size': 14}
)
plt.tight_layout()
RESULT:

Since we're only using the default model parameters, we won't know which classifier is better. We should optimize each algorithm's parameters first to know which one has the best performance.


Meet the Authors

Fatih-Karabiber-profile-photo.jpg

Associate Professor of Computer Engineering. Author/co-author of over 30 journal publications. Instructor of graduate/undergraduate courses. Supervisor of Graduate thesis. Consultant to IT Companies.

Brendan Martin
Editor: Brendan
Founder of LearnDataSci
rhys-headshot-small-square.jpg
Editor: Rhys
Psychometrician

Get updates in your inbox

Join over 7,500 data science learners.