Skip to content

Movie Distance Solution


import numpy as np
from scipy import sparse

# 1. Build csc matrix
mat = sparse.csc_matrix((ratings, (user_ids, movie_ids)))

# 2. Normalize the columns
col_lengths = np.sqrt((mat.multiply(mat)).sum(axis=0))
normalized = mat.multiply(1/col_lengths).tocsc()

# 3. Calculate the Euclidean distance between normalized movie 2 and normalized movie 4
c2 = normalized.getcol(2)
c4 = normalized.getcol(4)
diffs = c2 - c4
sqrd_diffs = diffs.multiply(diffs)
np.sqrt(sqrd_diffs.sum())
# 1.17

Explanation

  1. Build a Compressed Sparse Column (CSC) matrix using the csc_matrix((data, (row_indices, col_indices))) constructor.

    import numpy as np
    from scipy import sparse
    
    mat = sparse.csc_matrix((ratings, (user_ids, movie_ids)))
    
    # print the first 5 rows & cols
    print(mat[:5, :5].todense())
    # [[0 0 0 0 0]
    #  [5 0 0 0 0]
    #  [0 2 0 0 0]
    #  [5 0 0 0 0]
    #  [0 0 0 0 3]]
    
  2. Normalize the movie vectors (i.e. the column vectors)

    col_lengths = np.sqrt((mat.multiply(mat)).sum(axis=0))
    normalized = mat.multiply(1/col_lengths).tocsc()
    
    # print the first 5 rows & cols
    print(normalized[:5, :5].todense())
    # [[0.         0.         0.         0.         0.        ]
    #  [0.61545745 0.         0.         0.         0.        ]
    #  [0.         0.26490647 0.         0.         0.        ]
    #  [0.61545745 0.         0.         0.         0.        ]
    #  [0.         0.         0.         0.         0.9486833 ]]
    

    To normalize a vector, we divide each of its components by the vector's length.

    • col_lengths = np.sqrt((mat.multiply(mat)).sum(axis=0)) gets the length of each column vector as a NumPy array.
    • mat.multiply(1/col_lengths) divides each column in mat by its length
    • .tocsc() converts the coo matrix to a csc matrix

    Warning

    mat.multiply(mat) does element-wise multiplication. Be careful not to use mat * mat! Unlike NumPy, the star operator performs matrix multiplication.

  3. Calculate the Euclidean distance between normalized movie 2 and normalized movie 4.

    c2 = normalized.getcol(2)
    c4 = normalized.getcol(4)
    diffs = c2 - c4
    sqrd_diffs = diffs.multiply(diffs)
    np.sqrt(sqrd_diffs.sum())
    # 1.17
    

See the problem