Skip to content

Clusterer

Module for clustering diff chunks based on their embeddings.

This module provides functionality to group related code changes together based on their semantic similarity, using vector embeddings and clustering algorithms. The clustering process helps identify related changes that should be committed together.

Key components: - DiffClusterer: Main class that implements clustering algorithms for diff chunks - ClusteringParams: Type definition for parameters used by clustering algorithms

The module supports multiple clustering methods: 1. Agglomerative (hierarchical) clustering: Builds a hierarchy of clusters based on distances between embeddings, using a distance threshold to determine final cluster boundaries 2. DBSCAN: Density-based clustering that groups points in high-density regions, treating low-density points as noise/outliers

logger module-attribute

logger = getLogger(__name__)

ClusteringParams

Bases: TypedDict

Type definition for clustering algorithm parameters.

These parameters configure the behavior of the clustering algorithms:

For agglomerative clustering: - n_clusters: Optional limit on number of clusters (None means no limit) - distance_threshold: Maximum distance for clusters to be merged (lower = more clusters) - metric: Distance metric to use (e.g., "precomputed" for precomputed distance matrix) - linkage: Strategy for calculating distances between clusters ("average", "single", etc.)

For DBSCAN: - eps: Maximum distance between points in the same neighborhood - min_samples: Minimum points required to form a dense region - metric: Distance metric to use

Source code in src/codemap/git/semantic_grouping/clusterer.py
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
class ClusteringParams(TypedDict, total=False):
	"""
	Type definition for clustering algorithm parameters.

	These parameters configure the behavior of the clustering algorithms:

	For agglomerative clustering:
	- n_clusters: Optional limit on number of clusters (None means no limit)
	- distance_threshold: Maximum distance for clusters to be merged (lower = more clusters)
	- metric: Distance metric to use (e.g., "precomputed" for precomputed distance matrix)
	- linkage: Strategy for calculating distances between clusters ("average", "single", etc.)

	For DBSCAN:
	- eps: Maximum distance between points in the same neighborhood
	- min_samples: Minimum points required to form a dense region
	- metric: Distance metric to use

	"""

	n_clusters: int | None
	distance_threshold: float | None
	metric: str
	linkage: str
	eps: float
	min_samples: int

n_clusters instance-attribute

n_clusters: int | None

distance_threshold instance-attribute

distance_threshold: float | None

metric instance-attribute

metric: str

linkage instance-attribute

linkage: str

eps instance-attribute

eps: float

min_samples instance-attribute

min_samples: int

T module-attribute

T = TypeVar('T')

DiffClusterer

Clusters diff chunks based on their semantic embeddings.

This class provides methods to group related code changes by their semantic similarity, using vector embeddings and standard clustering algorithms from scikit-learn.

Clustering helps identify code changes that are related to each other and should be grouped in the same commit, even if they appear in different files.

The class supports multiple clustering algorithms: 1. Agglomerative clustering: Hierarchical clustering that's good for finding natural groupings without needing to specify the exact number of clusters 2. DBSCAN: Density-based clustering that can identify outliers and works well with irregularly shaped clusters

Source code in src/codemap/git/semantic_grouping/clusterer.py
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
class DiffClusterer:
	"""
	Clusters diff chunks based on their semantic embeddings.

	This class provides methods to group related code changes by their semantic similarity,
	using vector embeddings and standard clustering algorithms from scikit-learn.

	Clustering helps identify code changes that are related to each other and should be
	grouped in the same commit, even if they appear in different files.

	The class supports multiple clustering algorithms:
	1. Agglomerative clustering: Hierarchical clustering that's good for finding natural
	   groupings without needing to specify the exact number of clusters
	2. DBSCAN: Density-based clustering that can identify outliers and works well with
	   irregularly shaped clusters

	"""

	def __init__(self, config_loader: "ConfigLoader", **kwargs: object) -> None:
		"""
		Initialize the clusterer.

		Args:
		    config_loader: ConfigLoader to use for configuration (follows DI pattern)
		    **kwargs: Additional parameters for the clustering algorithm:
		        - For agglomerative: distance_threshold, linkage, etc.
		        - For DBSCAN: eps, min_samples, etc.

		Raises:
		    ImportError: If scikit-learn is not installed

		"""
		self.config = config_loader.get.embedding.clustering
		self.method = self.config.method
		self.kwargs = kwargs

		# Import here to avoid making sklearn a hard dependency
		try:
			from sklearn.cluster import DBSCAN, AgglomerativeClustering
			from sklearn.metrics.pairwise import cosine_similarity

			self.AgglomerativeClustering = AgglomerativeClustering
			self.DBSCAN = DBSCAN
			self.cosine_similarity = cosine_similarity
		except ImportError as e:
			logger.exception("Failed to import scikit-learn. Please install it with: uv add scikit-learn")
			msg = "scikit-learn is required for clustering"
			raise ImportError(msg) from e

	def cluster(self, chunk_embeddings: list[tuple[DiffChunk, np.ndarray]]) -> list[list[DiffChunk]]:
		"""
		Cluster chunks based on their embeddings.

		              Process:
		              1. Extracts chunks and embeddings from input tuples
		              2. Computes a similarity matrix using cosine similarity
		              3. Converts similarity to distance matrix (1 - similarity)
		              4. Applies clustering algorithm based on the chosen method
		              5. Organizes chunks into clusters based on labels
		              6. Handles special cases like noise points in DBSCAN

		Args:
		    chunk_embeddings: List of (chunk, embedding) tuples where each embedding
		        is a numpy array representing the semantic vector of a code chunk

		Returns:
		    List of lists, where each inner list contains chunks in the same cluster.
		    With DBSCAN, noise points (label -1) are returned as individual single-item clusters.

		Examples:
		    >>> embedder = DiffEmbedder()
		    >>> chunk_embeddings = embedder.embed_chunks(diff_chunks)
		    >>> clusterer = DiffClusterer(method="agglomerative", distance_threshold=0.5)
		    >>> clusters = clusterer.cluster(chunk_embeddings)
		    >>> for i, cluster in enumerate(clusters):
		    ...     print(f"Cluster {i} has {len(cluster)} chunks")

		"""
		if not chunk_embeddings:
			return []

		# Extract chunks and embeddings
		chunks = [ce[0] for ce in chunk_embeddings]
		embeddings = np.array([ce[1] for ce in chunk_embeddings])

		# Compute similarity matrix (1 - cosine distance)
		similarity_matrix = self.cosine_similarity(embeddings)

		# Convert to distance matrix (1 - similarity)
		distance_matrix = 1 - similarity_matrix

		# Apply clustering
		if self.method == "agglomerative":
			# Default parameters if not provided
			params = {
				"n_clusters": None,
				"distance_threshold": self.config.agglomerative.distance_threshold,
				"metric": self.config.agglomerative.metric,
				"linkage": self.config.agglomerative.linkage,
			}
			params.update(self.kwargs)

			clustering = self.AgglomerativeClustering(**params)
			labels = clustering.fit_predict(distance_matrix)

		elif self.method == "dbscan":
			# Default parameters if not provided
			params = {
				"eps": self.config.dbscan.eps,
				"min_samples": self.config.dbscan.min_samples,
				"metric": self.config.dbscan.metric,
			}
			params.update(self.kwargs)

			clustering = self.DBSCAN(**params)
			labels = clustering.fit_predict(distance_matrix)

		else:
			msg = f"Unsupported clustering method: {self.method}"
			raise ValueError(msg)

		# Group chunks by cluster label
		clusters: dict[int, list[DiffChunk]] = {}
		for i, label in enumerate(labels):
			# Convert numpy integer to Python int
			label_key = int(label)
			if label_key not in clusters:
				clusters[label_key] = []
			clusters[label_key].append(chunks[i])

		# Convert to list of lists and handle noise points (-1 label in DBSCAN)
		result: list[list[DiffChunk]] = []
		for label, cluster_chunks in sorted(clusters.items()):
			if label != -1:  # Regular cluster
				result.append(cluster_chunks)
			else:  # Noise points - each forms its own cluster
				result.extend([[chunk] for chunk in cluster_chunks])

		return result

__init__

__init__(
	config_loader: ConfigLoader, **kwargs: object
) -> None

Initialize the clusterer.

Parameters:

Name Type Description Default
config_loader ConfigLoader

ConfigLoader to use for configuration (follows DI pattern)

required
**kwargs object

Additional parameters for the clustering algorithm: - For agglomerative: distance_threshold, linkage, etc. - For DBSCAN: eps, min_samples, etc.

{}

Raises:

Type Description
ImportError

If scikit-learn is not installed

Source code in src/codemap/git/semantic_grouping/clusterer.py
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
def __init__(self, config_loader: "ConfigLoader", **kwargs: object) -> None:
	"""
	Initialize the clusterer.

	Args:
	    config_loader: ConfigLoader to use for configuration (follows DI pattern)
	    **kwargs: Additional parameters for the clustering algorithm:
	        - For agglomerative: distance_threshold, linkage, etc.
	        - For DBSCAN: eps, min_samples, etc.

	Raises:
	    ImportError: If scikit-learn is not installed

	"""
	self.config = config_loader.get.embedding.clustering
	self.method = self.config.method
	self.kwargs = kwargs

	# Import here to avoid making sklearn a hard dependency
	try:
		from sklearn.cluster import DBSCAN, AgglomerativeClustering
		from sklearn.metrics.pairwise import cosine_similarity

		self.AgglomerativeClustering = AgglomerativeClustering
		self.DBSCAN = DBSCAN
		self.cosine_similarity = cosine_similarity
	except ImportError as e:
		logger.exception("Failed to import scikit-learn. Please install it with: uv add scikit-learn")
		msg = "scikit-learn is required for clustering"
		raise ImportError(msg) from e

config instance-attribute

config = clustering

method instance-attribute

method = method

kwargs instance-attribute

kwargs = kwargs

AgglomerativeClustering instance-attribute

AgglomerativeClustering = AgglomerativeClustering

DBSCAN instance-attribute

DBSCAN = DBSCAN

cosine_similarity instance-attribute

cosine_similarity = cosine_similarity

cluster

cluster(
	chunk_embeddings: list[tuple[DiffChunk, ndarray]],
) -> list[list[DiffChunk]]

Cluster chunks based on their embeddings.

          Process:
          1. Extracts chunks and embeddings from input tuples
          2. Computes a similarity matrix using cosine similarity
          3. Converts similarity to distance matrix (1 - similarity)
          4. Applies clustering algorithm based on the chosen method
          5. Organizes chunks into clusters based on labels
          6. Handles special cases like noise points in DBSCAN

Parameters:

Name Type Description Default
chunk_embeddings list[tuple[DiffChunk, ndarray]]

List of (chunk, embedding) tuples where each embedding is a numpy array representing the semantic vector of a code chunk

required

Returns:

Type Description
list[list[DiffChunk]]

List of lists, where each inner list contains chunks in the same cluster.

list[list[DiffChunk]]

With DBSCAN, noise points (label -1) are returned as individual single-item clusters.

Examples:

>>> embedder = DiffEmbedder()
>>> chunk_embeddings = embedder.embed_chunks(diff_chunks)
>>> clusterer = DiffClusterer(method="agglomerative", distance_threshold=0.5)
>>> clusters = clusterer.cluster(chunk_embeddings)
>>> for i, cluster in enumerate(clusters):
...     print(f"Cluster {i} has {len(cluster)} chunks")
Source code in src/codemap/git/semantic_grouping/clusterer.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
def cluster(self, chunk_embeddings: list[tuple[DiffChunk, np.ndarray]]) -> list[list[DiffChunk]]:
	"""
	Cluster chunks based on their embeddings.

	              Process:
	              1. Extracts chunks and embeddings from input tuples
	              2. Computes a similarity matrix using cosine similarity
	              3. Converts similarity to distance matrix (1 - similarity)
	              4. Applies clustering algorithm based on the chosen method
	              5. Organizes chunks into clusters based on labels
	              6. Handles special cases like noise points in DBSCAN

	Args:
	    chunk_embeddings: List of (chunk, embedding) tuples where each embedding
	        is a numpy array representing the semantic vector of a code chunk

	Returns:
	    List of lists, where each inner list contains chunks in the same cluster.
	    With DBSCAN, noise points (label -1) are returned as individual single-item clusters.

	Examples:
	    >>> embedder = DiffEmbedder()
	    >>> chunk_embeddings = embedder.embed_chunks(diff_chunks)
	    >>> clusterer = DiffClusterer(method="agglomerative", distance_threshold=0.5)
	    >>> clusters = clusterer.cluster(chunk_embeddings)
	    >>> for i, cluster in enumerate(clusters):
	    ...     print(f"Cluster {i} has {len(cluster)} chunks")

	"""
	if not chunk_embeddings:
		return []

	# Extract chunks and embeddings
	chunks = [ce[0] for ce in chunk_embeddings]
	embeddings = np.array([ce[1] for ce in chunk_embeddings])

	# Compute similarity matrix (1 - cosine distance)
	similarity_matrix = self.cosine_similarity(embeddings)

	# Convert to distance matrix (1 - similarity)
	distance_matrix = 1 - similarity_matrix

	# Apply clustering
	if self.method == "agglomerative":
		# Default parameters if not provided
		params = {
			"n_clusters": None,
			"distance_threshold": self.config.agglomerative.distance_threshold,
			"metric": self.config.agglomerative.metric,
			"linkage": self.config.agglomerative.linkage,
		}
		params.update(self.kwargs)

		clustering = self.AgglomerativeClustering(**params)
		labels = clustering.fit_predict(distance_matrix)

	elif self.method == "dbscan":
		# Default parameters if not provided
		params = {
			"eps": self.config.dbscan.eps,
			"min_samples": self.config.dbscan.min_samples,
			"metric": self.config.dbscan.metric,
		}
		params.update(self.kwargs)

		clustering = self.DBSCAN(**params)
		labels = clustering.fit_predict(distance_matrix)

	else:
		msg = f"Unsupported clustering method: {self.method}"
		raise ValueError(msg)

	# Group chunks by cluster label
	clusters: dict[int, list[DiffChunk]] = {}
	for i, label in enumerate(labels):
		# Convert numpy integer to Python int
		label_key = int(label)
		if label_key not in clusters:
			clusters[label_key] = []
		clusters[label_key].append(chunks[i])

	# Convert to list of lists and handle noise points (-1 label in DBSCAN)
	result: list[list[DiffChunk]] = []
	for label, cluster_chunks in sorted(clusters.items()):
		if label != -1:  # Regular cluster
			result.append(cluster_chunks)
		else:  # Noise points - each forms its own cluster
			result.extend([[chunk] for chunk in cluster_chunks])

	return result