openclean.cluster.index module

Index structure for value clusters.

class openclean.cluster.index.ClusterIndex

Bases: object

Index structure to maintain a set of clusters. Implements a prefix tree.

add(cluster: openclean.cluster.base.Cluster) bool

Add the given cluster to the index. Returns True if the cluster was added as a new cluster (i.e., it did not exist in the index before) and False otherwise.

Parameters

cluster (openclean.cluser.base.Cluster) – Cluster of data value.

Return type

bool

class openclean.cluster.index.Node(key: str, count: int)

Bases: object

Node in the cluster index.

add(values: List[Tuple[str, int]], pos: int) bool

Add the values in the given list starting from pos to the children of this node.

Returns True if at the end the cluster was added as a new cluster to the index.

Parameters
  • values (list of tuples of string and count) – List of values and the frequencies in a cluster that is being added to the cluster index.

  • pos (int) – Index position in the list that points to the child node that is added to this node.