# Movie Distance Solution¶

import numpy as np
from scipy import sparse

# 1. Build csc matrix
mat = sparse.csc_matrix((ratings, (user_ids, movie_ids)))

# 2. Normalize the columns
col_lengths = np.sqrt((mat.multiply(mat)).sum(axis=0))
normalized = mat.multiply(1/col_lengths).tocsc()

# 3. Calculate the Euclidean distance between normalized movie 2 and normalized movie 4
c2 = normalized.getcol(2)
c4 = normalized.getcol(4)
diffs = c2 - c4
sqrd_diffs = diffs.multiply(diffs)
np.sqrt(sqrd_diffs.sum())
# 1.17


### Explanation¶

1. Build a Compressed Sparse Column (CSC) matrix using the csc_matrix((data, (row_indices, col_indices))) constructor.

import numpy as np
from scipy import sparse

mat = sparse.csc_matrix((ratings, (user_ids, movie_ids)))

# print the first 5 rows & cols
print(mat[:5, :5].todense())
# [[0 0 0 0 0]
#  [5 0 0 0 0]
#  [0 2 0 0 0]
#  [5 0 0 0 0]
#  [0 0 0 0 3]]

2. Normalize the movie vectors (i.e. the column vectors)

col_lengths = np.sqrt((mat.multiply(mat)).sum(axis=0))
normalized = mat.multiply(1/col_lengths).tocsc()

# print the first 5 rows & cols
print(normalized[:5, :5].todense())
# [[0.         0.         0.         0.         0.        ]
#  [0.61545745 0.         0.         0.         0.        ]
#  [0.         0.26490647 0.         0.         0.        ]
#  [0.61545745 0.         0.         0.         0.        ]
#  [0.         0.         0.         0.         0.9486833 ]]


To normalize a vector, we divide each of its components by the vector's length.

• col_lengths = np.sqrt((mat.multiply(mat)).sum(axis=0)) gets the length of each column vector as a NumPy array.
• mat.multiply(1/col_lengths) divides each column in mat by its length
• .tocsc() converts the coo matrix to a csc matrix

Warning

mat.multiply(mat) does element-wise multiplication. Be careful not to use mat * mat! Unlike NumPy, the star operator performs matrix multiplication.

3. Calculate the Euclidean distance between normalized movie 2 and normalized movie 4.

c2 = normalized.getcol(2)
c4 = normalized.getcol(4)
diffs = c2 - c4
sqrd_diffs = diffs.multiply(diffs)
np.sqrt(sqrd_diffs.sum())
# 1.17