Python Sparse Matrices

A sparse matrix is a matrix with a bunch of 0s, like this.

[[0 0 3 0 0]
 [0 0 0 0 1]
 [1 0 0 0 0]
 [0 2 0 0 1]]

Numerous methods exist to store sparse matrices in compressed formats that consume significantly less memory than typical dense¹ matrices.

Compressed sparse matrices in Python are most commonly handled with the sparse module from scipy. Accordingly, this problem set tests your knowledge of scipy's sparse matrices.

Do you know NumPy?

scipy is built on top of NumPy. We highly recommend you learn some NumPy before learning scipy sparse matrices.

How to install scipy¶

pip install scipy

How to import scipy's sparse module¶

from scipy import sparse

Sparse Matrix Compression Formats¶

Sparse matrices can be stored in various compressed formats, each with their own pros and cons.

Format	Details	Pros	Cons
Dictionary Of Keys (DOK)	uses dictionary to map (row, column)-pairs to the value of the elements	- fast access to individual elements - good for incremental matrix construction	- slow access to rows, cols
COOrdinate Format (COO)	stores a list of (row, column, value) tuples	- good for incremental matrix construction	- slow access to elements - slow access rows, cols
List of Lists (LIL)	stores one list per row, with each entry containing the column index and the value	- good for incremental matrix construction	- slow access to elements - slow access to rows, cols
Compressed Sparse Row (CSR)	Stores three arrays: - non-zero entries - column indices of non-zero entries - cumulative count of non-zero entries, per row (prepended by 0)	- fast row access and matrix-vector multiplications	- can be bad for incremental matrix construction
Compressed Sparse Column (CSC)	Stores three arrays: - non-zero entries - row indices of non-zero entries - cumulative count of non-zero entries, per column (prepended by 0)	- fast column access and matrix-vector multiplications	- can be bad for incremental matrix construction

Compressed Sparse Row (CSR)¶

Suppose you have a sparse matrix like this.

uncompressed = np.array(
    [[0, 0, 3, 0, 0],
     [0, 0, 0, 0, 1],
     [1, 0, 0, 2, 0],
     [0, 0, 0, 0, 0],
     [0, 0, 0, 0, 1]]
)

To store this matrix in CSR format requires three arrays:

data: an array containing the value of each non-zero element
In this example: [3, 1, 1, 2, 1]
indices: an array containing the column index of each non-zero element
In this example: [2, 4, 0, 3, 4]
indptr: an array containing the cumulative count of non-zero elements per row of the matrix, prepended by 0
In this example: [0, 1, 2, 4, 4, 5]

You can confirm this in scipy by converting uncompressed to a csr_matrix and inspecting its data, indices, and indptr attributes.

# Convert to CSR
csr = sparse.csr_matrix(uncompressed)

csr.data     # [3, 1, 1, 2, 1]
csr.indices  # [2, 4, 0, 3, 4]
csr.indptr   # [0, 1, 2, 4, 4, 5]

Why is this useful?
It's incredibly efficient to reconstruct and operate on rows of this matrix because indptr provides a direct way to access the non-zero elements of any row. For example consider the following request:

Fetch the non-zero elements in the third row of the matrix (row index 2).

To accomplish this,

Look up the number of non-zero elements that occur before row index 2.

csr.indptr[2]  # 2 (1)

0: [[0, 0, 3, 0, 0],
1:  [0, 0, 0, 0, 1],
2:  [1, 0, 0, 2, 0],
3:  [0, 0, 0, 0, 0],
4:  [0, 0, 0, 0, 1]]

Look up the number of non-zero elements that occur on or before row index 2.

csr.indptr[3]  # 4 (1)

0: [[0, 0, 3, 0, 0],
1:  [0, 0, 0, 0, 1],
2:  [1, 0, 0, 2, 0],
3:  [0, 0, 0, 0, 0],
4:  [0, 0, 0, 0, 1]]

Use those values as slice indices to fetch the corresponding data elements.
```
csr.data[2:4]  # [1, 2]
```
Fetching their corresponding column indices is just as easy.
```
csr.indices[2:4]  # [0, 3]
```

Compressed Sparse Column (CSC)¶

Compressed Sparse Column format is just a reflection of Compressed Sparse Row format. For example, suppose you have the following sparse matrix.

uncompressed = np.array(
    [[0, 0, 3, 0, 0],
     [0, 0, 0, 0, 1],
     [1, 0, 0, 2, 0],
     [0, 0, 0, 0, 0],
     [0, 0, 0, 0, 1]]
)

To store this matrix in CSC format requires three arrays:

data: an array containing the value of each non-zero element
In this example: [1, 3, 2, 1, 1]
indices: an array containing the row index of each non-zero element
In this example: [2, 0, 2, 1, 4]
indptr: an array containing the cumulative count of non-zero elements per column of the matrix, prepended by 0
In this example: [0, 1, 1, 2, 3, 5]

You can confirm this in scipy by converting uncompressed to a csc_matrix and inspecting its data, indices, and indptr attributes.

# Convert to CSC
csc = sparse.csc_matrix(uncompressed)

csc.data    # [1, 3, 2, 1, 1]
csc.indices # [2, 0, 2, 1, 4]
csc.indptr  # [0, 1, 1, 2, 3, 5]

Why is this useful?
It's incredibly efficient to reconstruct and operate on columns of this matrix because indptr provides a direct way to access the non-zero elements of any column. For example consider the following request:

Fetch the non-zero elements in the fifth column of the matrix (column index 4).

To accomplish this,

Look up the number of non-zero elements that occur before column index 4.
```
csc.indptr[4]  # 3
```
Look up the number of non-zero elements that occur on or before column index 4.
```
csc.indptr[5]  # 5
```
Use those values to fetch the corresponding data elements.
```
csc.data[3:5]  # [1, 1]
```
Fetching their corresponding row indices is just as easy.
```
csc.indices[3:5]  # [1, 4]
```

Basic Use¶

Instantiation¶

There are many ways to instantiate a compressed sparse matrix in scipy. Perhaps the most common technique is to use (row, col, val) triplets. For example,

from scipy import sparse

row_idxs = [0, 1, 1, 3, 5, 5, 5]
col_idxs = [2, 2, 6, 1, 0, 3, 6]
vals     = [1, 1, 1, 1, 1, 1, 1]

# Instantiate a compressed sparse column matrix
csc = sparse.csc_matrix((vals, (row_idxs, col_idxs)))

csc
# <6x7 sparse matrix of type '<class 'numpy.int64'>'
#   with 7 stored elements in Compressed Sparse Column format>

It's also common to convert a NumPy array to a compressed sparse matrix.

import numpy as np
from scipy import sparse

foo = np.array(
    [[0, 0, 3, 0, 0],
     [0, 0, 0, 0, 1],
     [1, 0, 0, 2, 0],
     [0, 0, 0, 0, 0],
     [0, 0, 0, 0, 1]]
)

# Instantiate a compressed sparse column matrix
csc = sparse.csc_matrix(foo)

csc
# <5x5 sparse matrix of type '<class 'numpy.int64'>'
#   with 5 stored elements in Compressed Sparse Column format>

Caution

This is generally considered bad practice. If your goal is to build a compressed sparse matrix to save memory, building a numpy array at any stage in the process defeats this purpose. Seek a better instantiation method.

Data Access¶

Given a compressed sparse matrix, mat,

import numpy as np

temp = np.array(
    [[0, 0, 3, 0, 0],
     [0, 0, 0, 0, 1],
     [1, 0, 0, 2, 0],
     [0, 0, 0, 0, 0],
     [0, 0, 0, 0, 1]]
)

mat = sparse.csc_matrix(temp)

You can access rows and columns just like a NumPy array.

# get row 3 (row index 2)
mat[2]
# <1x5 sparse matrix of type '<class 'numpy.int64'>'
#   with 2 stored elements in Compressed Sparse Column format>

# get columns 2-4
mat[:, 3:5]
# <5x2 sparse matrix of type '<class 'numpy.int64'>'
#   with 3 stored elements in Compressed Sparse Column format>

How do I print the matrix?¶

print(mat) prints the non-zero entries of the matrix.

print(mat)
# (2, 0)    1
# (0, 2)    3
# (2, 3)    2
# (1, 4)    1
# (4, 4)    1

If you want to actually print the matrix, convert it to a NumPy array.

print(mat.todense())
# [[0 0 3 0 0]
#  [0 0 0 0 1]
#  [1 0 0 2 0]
#  [0 0 0 0 0]
#  [0 0 0 0 1]]

Beware

This can blow up your computer if the sparse matrix is very large.

A dense matrix is one in which most of the elements are non-zero. However, scipy (and others) use the term "dense" to refer to a matrix that's not in compressed format. ↩