Clusterer
Module for clustering diff chunks based on their embeddings.
This module provides functionality to group related code changes together based on their semantic similarity, using vector embeddings and clustering algorithms. The clustering process helps identify related changes that should be committed together.
Key components: - DiffClusterer: Main class that implements clustering algorithms for diff chunks - ClusteringParams: Type definition for parameters used by clustering algorithms
The module supports multiple clustering methods: 1. Agglomerative (hierarchical) clustering: Builds a hierarchy of clusters based on distances between embeddings, using a distance threshold to determine final cluster boundaries 2. DBSCAN: Density-based clustering that groups points in high-density regions, treating low-density points as noise/outliers
logger
module-attribute
logger = getLogger(__name__)
ClusteringParams
Bases: TypedDict
Type definition for clustering algorithm parameters.
These parameters configure the behavior of the clustering algorithms:
For agglomerative clustering: - n_clusters: Optional limit on number of clusters (None means no limit) - distance_threshold: Maximum distance for clusters to be merged (lower = more clusters) - metric: Distance metric to use (e.g., "precomputed" for precomputed distance matrix) - linkage: Strategy for calculating distances between clusters ("average", "single", etc.)
For DBSCAN: - eps: Maximum distance between points in the same neighborhood - min_samples: Minimum points required to form a dense region - metric: Distance metric to use
Source code in src/codemap/git/semantic_grouping/clusterer.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
n_clusters
instance-attribute
n_clusters: int | None
distance_threshold
instance-attribute
distance_threshold: float | None
metric
instance-attribute
metric: str
linkage
instance-attribute
linkage: str
eps
instance-attribute
eps: float
min_samples
instance-attribute
min_samples: int
T
module-attribute
T = TypeVar('T')
DiffClusterer
Clusters diff chunks based on their semantic embeddings.
This class provides methods to group related code changes by their semantic similarity, using vector embeddings and standard clustering algorithms from scikit-learn.
Clustering helps identify code changes that are related to each other and should be grouped in the same commit, even if they appear in different files.
The class supports multiple clustering algorithms: 1. Agglomerative clustering: Hierarchical clustering that's good for finding natural groupings without needing to specify the exact number of clusters 2. DBSCAN: Density-based clustering that can identify outliers and works well with irregularly shaped clusters
Source code in src/codemap/git/semantic_grouping/clusterer.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
|
__init__
__init__(
config_loader: ConfigLoader, **kwargs: object
) -> None
Initialize the clusterer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_loader
|
ConfigLoader
|
ConfigLoader to use for configuration (follows DI pattern) |
required |
**kwargs
|
object
|
Additional parameters for the clustering algorithm: - For agglomerative: distance_threshold, linkage, etc. - For DBSCAN: eps, min_samples, etc. |
{}
|
Raises:
Type | Description |
---|---|
ImportError
|
If scikit-learn is not installed |
Source code in src/codemap/git/semantic_grouping/clusterer.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
config
instance-attribute
config = clustering
method
instance-attribute
method = method
kwargs
instance-attribute
kwargs = kwargs
AgglomerativeClustering
instance-attribute
AgglomerativeClustering = AgglomerativeClustering
DBSCAN
instance-attribute
DBSCAN = DBSCAN
cosine_similarity
instance-attribute
cosine_similarity = cosine_similarity
cluster
Cluster chunks based on their embeddings.
Process:
1. Extracts chunks and embeddings from input tuples
2. Computes a similarity matrix using cosine similarity
3. Converts similarity to distance matrix (1 - similarity)
4. Applies clustering algorithm based on the chosen method
5. Organizes chunks into clusters based on labels
6. Handles special cases like noise points in DBSCAN
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_embeddings
|
list[tuple[DiffChunk, ndarray]]
|
List of (chunk, embedding) tuples where each embedding is a numpy array representing the semantic vector of a code chunk |
required |
Returns:
Type | Description |
---|---|
list[list[DiffChunk]]
|
List of lists, where each inner list contains chunks in the same cluster. |
list[list[DiffChunk]]
|
With DBSCAN, noise points (label -1) are returned as individual single-item clusters. |
Examples:
>>> embedder = DiffEmbedder()
>>> chunk_embeddings = embedder.embed_chunks(diff_chunks)
>>> clusterer = DiffClusterer(method="agglomerative", distance_threshold=0.5)
>>> clusters = clusterer.cluster(chunk_embeddings)
>>> for i, cluster in enumerate(clusters):
... print(f"Cluster {i} has {len(cluster)} chunks")
Source code in src/codemap/git/semantic_grouping/clusterer.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
|