Skip to content Skip to sidebar Skip to footer

Return The Furthermost Outlier In Kmeans Clustering?

Is there any easy way to return the furthermost outlier after sklearn kmeans clustering? Essentially I want to make a list of the biggest outliers for a load of clusters. Unfortuna

Solution 1:

K-means is not well suited for "outlier" detection.

k-means has a tendency to make outliers a one-element cluster. Then the outliers have the smallest possible distance and will not be detected.

K-means is not robust enough when there are outliers in your data. You may actually want to remove outliers prior to using k-means.

Use rather something like kNN, LOF or LoOP instead.

Solution 2:

Sascha basically gives it away in the comments, but if X denotes your data, and model the instance of KMeans, you can sort the values of X by the distance to their centers through

X[np.argsort(np.linalg.norm(X - model.cluster_centers_[model.labels_], axis=1))]

Alternatively, since you know that each point is assigned to the cluster whose center minimizes Euclidean distance to the point, you can fit and sort in one step through

X[np.argsort(np.min(KMeans(n_clusters=2).fit_transform(X), axis=1))]

Post a Comment for "Return The Furthermost Outlier In Kmeans Clustering?"