Header Ads

using scipy connected_components correctly

Scipy provides various sparse matrix representations to work with sparse data with high dimentionality.
Some times we need to find clusters in sparse matrix based on similar trributes. we can model this using record to record sparse matrix with node i,j contains non zero similarity score.

          D1,D2,D3,D4,D5
graph = [
D1...  [ 5,  4 ,  4,  0 ,  0 ],
D2...  [ 4,  5 ,  3 , 0 ,  0 ],
D3...  [ 4,  3,   5,  0,   0],
D4...  [0,   0 ,  0,  5,   1],
D5...  [0,   0,   0,  1,   5]
...  ]
Here we have 
similarity(D1,D2)=4
similarity(D1,D3)=4
similarity(D2,D3)=3
...
...


from scipy.sparse import csr_matrix
graph=[[ 5,  4 ,  4,  0 ,  0 ],
[ 4,  5 ,  3 , 0 ,  0 ],
[ 4,  3,   5,  0,   0],
[0,   0 ,  0,  5,   1],
[0,   0,   0,  1,   5]]
graph=csr_matrix(graph)
print(repr(graph))

out: 

<5x5 sparse matrix of type '<class 'numpy.int64'>
'with 13 stored elements in Compressed Sparse Row format>


suppose now i have a requirement to take only similarity higher than a threshold we can do

graph[graph<threshold]=0
connected_components(graph)

out:

(1, array([0, 0, 0, 0, 0], dtype=int32))


which is incorrect because now graph contains contains more values than previous graph


print(graph.toarray())

out:

array([[5, 4, 4, 0, 0],

       [4, 5, 3, 0, 0],

       [4, 3, 5, 0, 0],

       [0, 0, 0, 5, 0],

       [0, 0, 0, 0, 5]])

print(repr(graph))

out:

<5x5 class="" matrix="" numpy.int64="" of="" sparse="" type="">'
with 25 stored elements in Compressed Sparse Row format>

here all zeros we got into csr matrix because we did graph[graph<threshold]=0 and as per documentation of csgraph. it considers zero weight edges as valid edges.To get correct graph we need to remove all zeroes from the csr_matrix using eliminate_zeros().
graph.eliminate_zeros()
print(repr(graph))
out:
<5x5 matrix="" numpy.int64="" of="" sparse="" type="">'with 11 stored elements in Compressed Sparse Row format>

connected_components(graph)
out:

(3, array([0, 0, 0, 1, 2], dtype=int32))

So after performing any thresholding operation on sparse matrix we need to call eliminate_zeros() before calling connected_components() or any other csgraph based functions if we don't want  extra zero weight edges in graph. 

No comments

Powered by Blogger.