I am trying to reduce the spatial data set size by clustering them and finding the center point for the clusters. I referenced to this article (which uses DBSCAN)and it kind of helped except that now the data set size has increased, I am now unable to go forward b/c of memory errors. So, i switched to next best thing HDBSCAN. But, I am getting some strange results. First, I am using following:

clusterer = hdbscan.HDBSCAN(min_samples=1, min_cluster_size=25, algorithm='prims_balltree', metric='haversine')

This is able to provide clusters but when I dig into these clusters, they are practically the same. e.g. two clusters comprising of similar geo-locations. My idea is that it should have been a single cluster.

Second, To resolve such the above problem, I tried using cluster_selection_epsilon=0.1/6371 to cluster geo-locations within 100m in same cluster.

clusterer = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=10, metric='haversine',cluster_selection_epsilon=0.1/6371)

But, then i get this one big cluster with over hundred thousand points and while plotting on folium I found that those points are not within 100m apart, rather they are separate clusters of points that are 100m apart. I am probably not using the min_cluster_size in terms of haversine metric. Can someone explain what's happening. How can I achieve my goal of clustering similar geo-locations. and narrow down the cluster to one center point?

Related posts

Recent Viewed