You are reading glossary

Author: Fatih Karabiber
Ph.D. in Computer Engineering, Data Scientist

Sparse Matrix

LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.

What is a sparse matrix?

A sparse matrix is a special case of a matrix in which the number of zero elements is much higher than the number of non-zero elements. As a rule of thumb, if 2/3 of the total elements in a matrix are zeros, it can be called a sparse matrix. Using a sparse matrix representation — where only the non-zero values are stored — the space used for representing data and the time for scanning the matrix are reduced significantly.

Many applications in data science and machine learning involve sparse matrices, such as:

Natural Language Processing: The occurrence of words in documents can be represented in a sparse matrix. The words in a document are only a small subset of words in a language. If we have a row for every document and a column for every word, then storing the number of word occurrences in a document has a high percentage of zeros in every column.
Recommendation Systems: A sparse matrix can be employed to represent whether any particular user has watched any movie. See our Locality Sensitive Hashing (LSH) article for an example.
Market Basket Analysis: Since the number of purchased items is tiny compared to the number of non-purchased items, a sparse matrix is used to represent all products and customers.

Numerical example 1

Let's take the example of a movie recommendation system. There are millions of users and thousands of movies, so it's not possible for users to watch and rate all movies. This data can be represented as a matrix where the rows are the users, and the columns are the movies.

Most of the matrix elements will be empty, where the missing values will be replaced with zeros. Since a small percentage of the matrix has non-zero values, this matrix can be considered a sparse matrix. A small portion of the data is given below:

	Movie1	Movie2	Movie3	Movie4	Movie5	Movie6	Movie7
User1	0	0	0	3	0	0	4
User2	0	5	0	0	0	0	0
User3	0	0	5	0	0	4	0
User4	4	0	0	0	0	0	1
User5	0	2	0	0	3	0	0

The sparsity of this matrix can be calculated by obtaining the ratio of zero elements to total elements. For this example, sparsity is calculated as:

$$\begin{align} sparsity &= \frac {n_{zeros}}{n_{total}} \\[.5em] &= \frac{26}{35} \\[.5em] &= 0.742 \end{align}$$

It can be seen that the number of zeros in a sparse matrix is very high. Representing all zero values in a matrix like this would result in high memory usage, so in practice, only non-zero values of the sparse matrix are stored.

Numerical example 2

Another example would be to use a matrix to represent the occurrence of words in documents. The term-document matrix dimension will be $n \times m$, where $n$ is the number of documents and $m$ is the number of words in the language model. As a result, most of the matrix elements will be zero since only non-zero values are important for data analysis. In addition to a large amount of space used, there will be a computational time problem because all elements will be scanned to access non-zero elements. This process yields a computational complexity problem.

To overcome these problems, we can use different data structures to represent a sparse matrix. One common representation format for a sparse matrix is a Coordinate list (COO), which uses three-element tuples to store non-zero values' coordinates in a matrix. For example, the following table can be constructed to represent a sparse term-document matrix:

Row	Column	Value
0	3	3
0	6	4
1	1	5
2	2	5
2	5	4
3	0	4
3	6	1
4	1	2
4	4	3

In this table, indices of rows and columns of non-zero values are stored in a sparse representation. Let $k$ be the number of non-zero elements in a matrix of size $n \times m$, then the proportion of the space saved by sparse matrix representation can simply be calculated as follows:

$$ p = 1- \frac{3k}{nm} $$

The space gained by a sparse matrix representation is directly proportional to the sparsity value.

There are many other ways to represent a sparse matrix, such as Dictionary of keys (DOK) and List of lists (LIL). In the following section, different representation formats will be explained with Python.

Sparse Matrix in Python

The Scipy library provides the scipy.sparse package to create and manipulate sparse matrix (https://docs.scipy.org/doc/scipy/reference/sparse.html. Different representation formats and some useful functions for sparse matrices are defined in this package.

Here we will explore some basic functions.

A simple, sparse matrix

A simple, sparse matrix will be constructed to show the representation formats of a sparse matrix in Python.

import numpy as np
from scipy import sparse

    
        Learn Data Science with

X = np.array([[0,0,0,3,0,0,4],
              [0,5,0,0,0,0,0],
              [0,0,5,0,0,4,0],
              [4,0,0,0,0,0,1],
              [0,2,0,0,3,0,0]])
print(X)

    
        Learn Data Science with

Out:

[[0 0 0 3 0 0 4]
 [0 5 0 0 0 0 0]
 [0 0 5 0 0 4 0]
 [4 0 0 0 0 0 1]
 [0 2 0 0 3 0 0]]

    
        Learn Data Science with

There's many zeros in this matrix, so let's calculate the sparsity of the matrix:

sparsity = 1.0 - (np.count_nonzero(X) / X.size)
print('The sparsity of X is ', sparsity )

    
        Learn Data Science with

Out:

The sparsity of X is  0.7428571428571429

    
        Learn Data Science with

We can convert this dense matrix into a sparse matrix by using the sparse.csr_matrix()function. The row/column indices of non-zero values are stored in a Compressed Sparse Row (CSR) matrix:

# Convert X to a sparse matrix

S1 = sparse.csr_matrix(X)

print(f"""
Type of sparse matrix representation: {type(S1)}

Sparse Matrix:\n{S1}

Sparse Data: {S1.data}

Indices of columns: {S1.indices}

Pointers for data: {S1.indptr}
""")

    
        Learn Data Science with

Out:

Type of sparse matrix representation: <class 'scipy.sparse.csr.csr_matrix'>

Sparse Matrix:
  (0, 3)	3
  (0, 6)	4
  (1, 1)	5
  (2, 2)	5
  (2, 5)	4
  (3, 0)	4
  (3, 6)	1
  (4, 1)	2
  (4, 4)	3

Sparse Data: [3 4 5 5 4 4 1 2 3]

Indices of columns: [3 6 1 2 5 0 6 1 4]

Pointers for data: [0 2 3 5 7 9]

    
        Learn Data Science with

Another efficient structure for constructing sparse matrices is the Dictionary Of Keys (DOK), where a python dictionary is used to represent non-zero values for a sparse matrix.

In this representation, keys() is used for indices, and values() is used for values of non-zero elements:

S2 = sparse.dok_matrix(X)

print(f"""
Type of sparse matrix representation: {type(S2)}

Sparse Matrix:\n{S2}

Keys in dictionary: {S2.keys()}

Values in dictionary: {S2.values()}
""")

    
        Learn Data Science with

Out:

Type of sparse matrix representation: <class 'scipy.sparse.dok.dok_matrix'>

Sparse Matrix:
  (0, 3)	3
  (0, 6)	4
  (1, 1)	5
  (2, 2)	5
  (2, 5)	4
  (3, 0)	4
  (3, 6)	1
  (4, 1)	2
  (4, 4)	3

Keys in dictionary: dict_keys([(0, 3), (0, 6), (1, 1), (2, 2), (2, 5), (3, 0), (3, 6), (4, 1), (4, 4)])

Values in dictionary: dict_values([3, 4, 5, 5, 4, 4, 1, 2, 3])

    
        Learn Data Science with

The last representation format shown here is a row-based list of lists sparse (LIL) matrix. The first list stores column indices for each row, and the second list is used to store the element's row values.

S3 = sparse.lil_matrix(X)

print(f"""
Type of sparse matrix representation: {type(S3)}

Sparse Matrix:\n{S3}

Lists for rows: {S3.rows}

Lists for columns: {S3.data}
""")

    
        Learn Data Science with

Out:

Type of sparse matrix representation: <class 'scipy.sparse.lil.lil_matrix'>

Sparse Matrix:
  (0, 3)	3
  (0, 6)	4
  (1, 1)	5
  (2, 2)	5
  (2, 5)	4
  (3, 0)	4
  (3, 6)	1
  (4, 1)	2
  (4, 4)	3

Lists for rows: [list([3, 6]) list([1]) list([2, 5]) list([0, 6]) list([1, 4])]

Lists for columns: [list([3, 4]) list([5]) list([5, 4]) list([4, 1]) list([2, 3])]

    
        Learn Data Science with

In scipy.sparse package, there is also a todense()function for converting a sparse matrix to a dense matrix:

# Convert the sparse matrix to a dense matrix
X = S1.todense()

print(X)

    
        Learn Data Science with

Out:

[[0 0 0 3 0 0 4]
 [0 5 0 0 0 0 0]
 [0 0 5 0 0 4 0]
 [4 0 0 0 0 0 1]
 [0 2 0 0 3 0 0]]

    
        Learn Data Science with

This can be useful when exploring data, but if your dataset is large, the dense matrix version won't fit in memory and may cause an error.

Sparse Matrix from a real dataset

We'll use the newsgroups dataset, available directly from sklearn, to show an example of when a sparse matrix would be used. This dataset contains thousands of news posts on 20 different topics.

A document can be represented as a term vector, where each term is a feature and the value is the number of times the corresponding term occurs in the document. In this way, a matrix can be used to represent word occurrences in all documents, where the documents are the rows and the terms are the columns. Since the number of non-zero values will be small, this matrix can be represented as a sparse matrix.

Let's import and load the dataset from sklearn:

from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train', categories= ['sci.electronics', 'sci.space'])

    
        Learn Data Science with

Previewing an example entry in this dataset:

newsgroups_train.data[0]

    
        Learn Data Science with

Out:

"From: [email protected] (Chris Best)\nSubject: Re: Food Dehydrators\nOrganization: your service\nLines: 10\nDistribution: usa\nNNTP-Posting-Host: hpctdkz.col.hp.com\n\n>   Does anybody out there have one of those food dehydrators I've been seeing\n> all over late-night TV recently? I was wondering if they use forced air, heat,\n> or both. If there's heat involved, anybody know what temperature they run at?\n> My wife would like one and I'm not inclined to pay >$100.00 for a box, a fan\n> and a heater. Seems to me you should be able to throw a dehydrator together\n> for just a few bucks. Heck, the technology is only what? 1,000 years old?\n\n----------\n\nYeah, but 1000 years ago, you couldn't buy it from a guy with sprayed-on hair!\n"

    
        Learn Data Science with

Now we'll use CountVectorizer to vectorize the text into a term-document matrix:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv.fit(newsgroups_train.data)

# Create a term-document matrix 
word_matrix = cv.transform(newsgroups_train.data)

    
        Learn Data Science with

print(f'Type of the matrix is: {type(word_matrix)}\n')

print(f'Matrix:\n{word_matrix}')

    
        Learn Data Science with

Out:

Type of the matrix is: <class 'scipy.sparse.csr.csr_matrix'>

Matrix:
  (0, 0)	1
  (0, 1)	1
  (0, 187)	1
  (0, 188)	1
  (0, 189)	1
  (0, 3013)	1
  (0, 3387)	1
  (0, 3424)	1
  (0, 3502)	1
  (0, 3688)	2
  (0, 3794)	2
  (0, 4144)	1
  (0, 4514)	1
  (0, 4557)	1
  (0, 4650)	1
  (0, 4936)	1
  (0, 4962)	1
  (0, 5127)	1
  (0, 5225)	1
  (0, 5234)	1
  (0, 5360)	1
  (0, 5912)	1
  (0, 6180)	2
  (0, 6237)	2
  (0, 6822)	1
  :	:
  (1183, 20345)	1
  (1183, 20370)	1
  (1183, 20381)	1
  (1183, 20386)	1
  (1183, 20398)	2
  (1183, 20416)	1
  (1183, 20509)	1
  (1183, 20548)	11
  (1183, 20975)	1
  (1183, 21002)	2
  (1183, 21077)	2
  (1183, 21224)	1
  (1183, 21305)	1
  (1183, 21344)	3
  (1183, 21345)	2
  (1183, 21358)	1
  (1183, 21541)	1
  (1183, 21892)	1
  (1183, 22075)	1
  (1183, 22096)	1
  (1183, 22139)	1
  (1183, 22233)	2
  (1183, 22318)	3
  (1183, 22347)	1
  (1183, 22478)	5

    
        Learn Data Science with

print(f'Size of the matrix is: {word_matrix.shape}')

    
        Learn Data Science with

Out:

Size of the matrix is: (1184, 22577)

    
        Learn Data Science with

print(f'Number of the Non-zero values: {word_matrix.nnz}')

    
        Learn Data Science with

Out:

Number of the Non-zero values: 174709

    
        Learn Data Science with

Next, we'll can calculate the sparsity:

sparsity = word_matrix.nnz / (word_matrix.shape[0] * word_matrix.shape[1])

print('Sparsity value: ', sparsity)

    
        Learn Data Science with

Out:

Sparsity value:  0.006535778758339329

    
        Learn Data Science with

Convert the sparse matrix to a dense matrix:

D = word_matrix.todense()

print(D)

    
        Learn Data Science with

Out:

[[1 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 2 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

    
        Learn Data Science with

With the dense version we can calculate the proportion of space saved by sparse matrix represantation:

r = 1 - (3 * np.count_nonzero(D)) / D.size

print('The proportion of saved space is ', r)

    
        Learn Data Science with

Out:

The proportion of saved space is  0.980392663724982

    
        Learn Data Science with

Start Learning for Free

Meet the Authors

Fatih Karabiber Ph.D. in Computer Engineering, Data Scientist

Associate Professor of Computer Engineering. Author/co-author of over 30 journal publications. Instructor of graduate/undergraduate courses. Supervisor of Graduate thesis. Consultant to IT Companies.

Back to blog index

Sparse Matrix

What is a sparse matrix?

Numerical example 1

Numerical example 2

Sparse Matrix in Python

A simple, sparse matrix

Sparse Matrix from a real dataset

Recent articles:

The 9 Best AI Courses Online for 2024: Beginner to Advanced

The 6 Best Python Courses for 2024 – Ranked by Software Engineer

Best Course Deals for Black Friday and Cyber Monday 2024

Sigmoid Function

7 Best Artificial Intelligence (AI) Courses

Meet the Authors

Cookie Policy

Sparse Matrix

What is a sparse matrix?

Numerical example 1

Numerical example 2

Sparse Matrix in Python

A simple, sparse matrix

Sparse Matrix from a real dataset

Get updates in your inbox

Recent articles:

The 9 Best AI Courses Online for 2024: Beginner to Advanced

The 6 Best Python Courses for 2024 – Ranked by Software Engineer

Best Course Deals for Black Friday and Cyber Monday 2024

Sigmoid Function

7 Best Artificial Intelligence (AI) Courses

Get updates in your inbox

Meet the Authors

Get updates in your inbox