Python Sparse Matrices
A sparse matrix is a matrix with a bunch of 0s, like this.
[[0 0 3 0 0]
[0 0 0 0 1]
[1 0 0 0 0]
[0 2 0 0 1]]
Numerous methods exist to store sparse matrices in compressed formats that consume significantly less memory than typical dense^{1} matrices.
Compressed sparse matrices in Python are most commonly handled with the sparse module from scipy. Accordingly, this problem set tests your knowledge of scipy's sparse matrices.
Do you know NumPy?
scipy is built on top of NumPy. We highly recommend you learn some NumPy before learning scipy sparse matrices.
How to install scipy¶
pip insall scipy
How to import scipy's sparse module¶
from scipy import sparse
Sparse Matrix Compression Formats¶
Sparse matrices can be stored in various compressed formats, each with their own pros and cons.
Format  Details  Pros  Cons 

Dictionary Of Keys (DOK)  uses dictionary to map (row, column)pairs to the value of the elements   fast access to individual elements  good for incremental matrix construction 
 slow access to rows, cols 
COOrdinate Format (COO)  stores a list of (row, column, value) tuples   good for incremental matrix construction   slow access to elements  slow access rows, cols 
List of Lists (LIL)  stores one list per row, with each entry containing the column index and the value   good for incremental matrix construction   slow access to elements  slow access to rows, cols 
Compressed Sparse Row (CSR)  Stores three arrays:  nonzero entries  column indices of nonzero entries  cumulative count of nonzero entries, per row (prepended by 0) 
 fast row access and matrixvector multiplications   can be bad for incremental matrix construction 
Compressed Sparse Column (CSC)  Stores three arrays:  nonzero entries  row indices of nonzero entries  cumulative count of nonzero entries, per column (prepended by 0) 
 fast column access and matrixvector multiplications   can be bad for incremental matrix construction 
Compressed Sparse Row (CSR)¶
Suppose you have a sparse matrix like this.
uncompressed = np.array(
[[0, 0, 3, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 2, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]]
)
To store this matrix in CSR format requires three arrays:
data
: an array containing the value of each nonzero element
In this example:[3, 1, 1, 2, 1]
indices
: an array containing the column index of each nonzero element
In this example:[2, 4, 0, 3, 4]
indptr
: an array containing the cumulative count of nonzero elements per row of the matrix, prepended by 0
In this example:[0, 1, 2, 4, 4, 5]
You can confirm this in scipy by converting uncompressed
to a csr_matrix
and inspecting its data
, indices
,
and indptr
attributes.
# Convert to CSR
csr = sparse.csr_matrix(uncompressed)
csr.data # [3, 1, 1, 2, 1]
csr.indices # [2, 4, 0, 3, 4]
csr.indptr # [0, 1, 2, 4, 4, 5]
Why is this useful?
It's incredibly efficient to reconstruct and operate on rows of this matrix because indptr
provides a direct
way to access the nonzero elements of any row. For example consider the following request:
Fetch the nonzero elements in the third row of the matrix (row index 2).
To accomplish this,

Look up the number of nonzero elements that occur before row index 2.
csr.indptr[2] # 2 (1)

0: [[0, 0, 3, 0, 0], 1: [0, 0, 0, 0, 1], 2: [1, 0, 0, 2, 0], 3: [0, 0, 0, 0, 0], 4: [0, 0, 0, 0, 1]]


Look up the number of nonzero elements that occur on or before row index 2.
csr.indptr[3] # 4 (1)

0: [[0, 0, 3, 0, 0], 1: [0, 0, 0, 0, 1], 2: [1, 0, 0, 2, 0], 3: [0, 0, 0, 0, 0], 4: [0, 0, 0, 0, 1]]


Use those values as slice indices to fetch the corresponding
data
elements.csr.data[2:4] # [1, 2]

Fetching their corresponding column indices is just as easy.
csr.indices[2:4] # [0, 3]
Compressed Sparse Column (CSC)¶
Compressed Sparse Column format is just a reflection of Compressed Sparse Row format. For example, suppose you have the following sparse matrix.
uncompressed = np.array(
[[0, 0, 3, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 2, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]]
)
To store this matrix in CSC format requires three arrays:
data
: an array containing the value of each nonzero element
In this example:[1, 3, 2, 1, 1]
indices
: an array containing the row index of each nonzero element
In this example:[2, 0, 2, 1, 4]
indptr
: an array containing the cumulative count of nonzero elements per column of the matrix, prepended by 0
In this example:[0, 1, 1, 2, 3, 5]
You can confirm this in scipy by converting uncompressed
to a csc_matrix
and inspecting its data
, indices
,
and indptr
attributes.
# Convert to CSC
csc = sparse.csc_matrix(uncompressed)
csc.data # [1, 3, 2, 1, 1]
csc.indices # [2, 0, 2, 1, 4]
csc.indptr # [0, 1, 1, 2, 3, 5]
Why is this useful?
It's incredibly efficient to reconstruct and operate on columns of this matrix because indptr
provides a direct
way to access the nonzero elements of any column. For example consider the following request:
Fetch the nonzero elements in the fifth column of the matrix (column index 4).
To accomplish this,

Look up the number of nonzero elements that occur before column index 4.
csc.indptr[4] # 3

Look up the number of nonzero elements that occur on or before column index 4.
csc.indptr[5] # 5

Use those values to fetch the corresponding
data
elements.csc.data[3:5] # [1, 1]

Fetching their corresponding row indices is just as easy.
csc.indices[3:5] # [1, 4]
Basic Use¶
Instantiation¶
There are many ways to instantiate a compressed sparse matrix in scipy. Perhaps the most common technique is to use (row, col, val) triplets. For example,
from scipy import sparse
row_idxs = [0, 1, 1, 3, 5, 5, 5]
col_idxs = [2, 2, 6, 1, 0, 3, 6]
vals = [1, 1, 1, 1, 1, 1, 1]
# Instantiate a compressed sparse column matrix
csc = sparse.csc_matrix((vals, (row_idxs, col_idxs)))
csc
# <6x7 sparse matrix of type '<class 'numpy.int64'>'
# with 7 stored elements in Compressed Sparse Column format>
It's also common to convert a NumPy array to a compressed sparse matrix.
import numpy as np
from scipy import sparse
foo = np.array(
[[0, 0, 3, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 2, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]]
)
# Instantiate a compressed sparse column matrix
csc = sparse.csc_matrix(foo)
csc
# <5x5 sparse matrix of type '<class 'numpy.int64'>'
# with 5 stored elements in Compressed Sparse Column format>
Caution
This is generally considered bad practice. If your goal is to build a compressed sparse matrix to save memory, building a numpy array at any stage in the process defeats this purpose. Seek a better instantiation method.
Data Access¶
Given a compressed sparse matrix, mat
,
import numpy as np
temp = np.array(
[[0, 0, 3, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 2, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]]
)
mat = sparse.csc_matrix(temp)
You can access rows and columns just like a NumPy array.
# get row 3 (row index 2)
mat[2]
# <1x5 sparse matrix of type '<class 'numpy.int64'>'
# with 2 stored elements in Compressed Sparse Column format>
# get columns 24
mat[:, 3:5]
# <5x2 sparse matrix of type '<class 'numpy.int64'>'
# with 3 stored elements in Compressed Sparse Column format>
How do I print the matrix?¶
print(mat)
prints the nonzero entries of the matrix.
print(mat)
# (2, 0) 1
# (0, 2) 3
# (2, 3) 2
# (1, 4) 1
# (4, 4) 1
If you want to actually print the matrix, convert it to a NumPy array.
print(mat.todense())
# [[0 0 3 0 0]
# [0 0 0 0 1]
# [1 0 0 2 0]
# [0 0 0 0 0]
# [0 0 0 0 1]]
Beware
This can blow up your computer if the sparse matrix is very large.

A dense matrix is one in which most of the elements are nonzero. However, scipy (and others) use the term "dense" to refer to a matrix that's not in compressed format. ↩