I am trying to reduce the spatial data set size by clustering them and finding the center point for the clusters. I referenced to this article (which uses
DBSCAN)and it kind of helped except that now the data set size has increased, I am now unable to go forward b/c of memory errors. So, i switched to next best thing
HDBSCAN. But, I am getting some strange results.
First, I am using following:
clusterer = hdbscan.HDBSCAN(min_samples=1, min_cluster_size=25, algorithm='prims_balltree', metric='haversine')
This is able to provide clusters but when I dig into these clusters, they are practically the same. e.g. two clusters comprising of similar geo-locations. My idea is that it should have been a single cluster.
Second, To resolve such the above problem, I tried using
cluster_selection_epsilon=0.1/6371 to cluster geo-locations within
100m in same cluster.
clusterer = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=10, metric='haversine',cluster_selection_epsilon=0.1/6371)
But, then i get this one big cluster with over hundred thousand points and while plotting on folium I found that those points are not within
100m apart, rather they are separate clusters of points that are 100m apart.
I am probably not using the
min_cluster_size in terms of
Can someone explain what's happening. How can I achieve my goal of clustering similar geo-locations. and narrow down the cluster to one center point?