Return The Furthermost Outlier In Kmeans Clustering?

March 09, 2024 Post a Comment

Is there any easy way to return the furthermost outlier after sklearn kmeans clustering? Essentially I want to make a list of the biggest outliers for a load of clusters. Unfortuna

Solution 1:

K-means is not well suited for "outlier" detection.

k-means has a tendency to make outliers a one-element cluster. Then the outliers have the smallest possible distance and will not be detected.

K-means is not robust enough when there are outliers in your data. You may actually want to remove outliers prior to using k-means.

Use rather something like kNN, LOF or LoOP instead.

Solution 2:

Sascha basically gives it away in the comments, but if X denotes your data, and model the instance of KMeans, you can sort the values of X by the distance to their centers through

X[np.argsort(np.linalg.norm(X - model.cluster_centers_[model.labels_], axis=1))]

Alternatively, since you know that each point is assigned to the cluster whose center minimizes Euclidean distance to the point, you can fit and sort in one step through

X[np.argsort(np.min(KMeans(n_clusters=2).fit_transform(X), axis=1))]

Python Dummy

Return The Furthermost Outlier In Kmeans Clustering?

Solution 1:

Solution 2:

Post a Comment for "Return The Furthermost Outlier In Kmeans Clustering?"