FINDING THE REPRESENTATIVE IN A CLUSTER USING CORRELATION CLUSTERING

: Correlation clustering is a widely used technique in data mining. The clusters contain objects, which are typically similar to each other and different from objects from other groups. It can be an interesting task to find the member, which is the most similar to the others for each group. These objects can be called representatives. In this paper, a possible way to find these representatives are shown and software to test the method is also provided


Correlation clustering
Clustering is a widely used tool of unsupervised learning. Its task is to group objects in a way that the objects in one group (cluster) are similar, and the objects from different groups are dissimilar. This defines an equivalence relation. The similarity is usually based on the distance of the objects. However, sometimes only categorical data are given where distance is meaningless. For example: what is the distance between a cat and a dog? In this case, a tolerance relation is needed. Two objects can be treated as similar if this relation holds for these two objects. If the relation does not hold for two objects, then they are dissimilar. Naturally, this relation is reflexive because every object is similar to itself. It is also symmetric, which can be easy to see. The transitivity, however, does not necessarily hold. If a human and a mouse are taken, then due to their inner structure they are similar. This is the reason why mice are used in drug V V T × ⊂ the tolerance relation. The result of correlation clustering is partition. This partition can be defined as a function: . So it assigns for each object an integer number, which is its cluster IDentification number (ID). The objects A and B are in the same group if ( ) ( ) The following two cases can be treated as conflicts for two arbitrary objects A and B: The cost function f is the number of these disagreements. The value of the function f is the distance between the tolerance relation T and the equivalence relation defined by the partition. Solving a correlation clustering problem is equivalent to minimizing its cost function. The partition is called perfect if the cost function value is 0. It is easy to show that for an arbitrary tolerance relation, there is no necessarily perfect partition. Fig. 1 shows a very simple example highlighting this issue.  Fig. 1 shows all the possible partitions of these objects, where rectangles indicate the clusters. The thick lines denote the pairs, which are counted in the cost function. In the upper row the value of the cost function is 1 (in each case), while in the two other cases it is 2 and 3, respectively.
Despite its many applications, it has a disadvantage. It is a Nondeterministic Polynomial (NP) time complete problem, so it is very complicated to find the partition with minimal cost function value. The number of partitions also grows exponentially. It can be given by the Bell number [11]. In general -even in the case of some dozens of objects -the optimal partition cannot be determined in reasonable time. However, a quasi-optimal solution can be enough in practical cases. This can be achieved by using search algorithms. In this paper the authors used a genetic search algorithm [12]. This is a simple, well-known algorithm [13], [14], which can provide a rather good solution. Naturally, other search algorithms can be used.

Representative
The clusters gained from the correlation clustering contain the typically similar objects. In many cases, it can be interesting to find the member, which is the most similar to the other ones. This object can be treated as the representative because it can represent the whole group. If a decision about a certain group of objects is made, then in many cases it can be useful if only the representative member is considered. This can decrease the resource requirements because only one object for each cluster has to be considered.
Imagine that a product needs to be sold, for example a toy to a group of children. Almost every group of youngsters has at least one member whose decision has the most influence on the group's life. In this case one child needs to be found and convinced to buy the toy. The rest of the group will follow them afterwards.
Of course, if a group has more than one 'leader', then one from these possible leaders can be chosen.
If finding the representative needs to be formulated using mathematics, the following can be a possible way: A member is called representative if it is similar to most of the members and different from the least of the members in the same cluster. So, for each member U four values have been stored: • α -the number of elements that are similar to U and are in the same cluster; • β -the number of elements that are different from U and are in the same cluster; • γ -the number of elements that are similar to U and are in different clusters; • δ -the number of elements that are different from U and are in different clusters.
In order to represent the idea of this paper, a proper similarity relation that can easily be visualized, is needed. The base of this relation is the Euclidean distance of the objects (d). Two thresholds were defined, one for similarity (S) and one for difference (D). The similarity relation (T) can be given the following way for each object A, B:  The smaller circle denotes the similarity threshold and the greater one the difference threshold. There are two possible ways of defining the representative of a cluster c: • If only the cluster c is considered, (first method); • Every cluster is considered, (second method).
In case of the first one, a member can be considered as representative if the following fraction is maximal: In case of the second method, a member can be considered as a representative if the following fraction is maximal: If two arbitrary objects have the same r 2 value, then the δ value decides. Of course, this method is only a possible way to define the representative members. Other similar methods can also be used.
The first method can be used when the members of the other groups do not matter. For example, let us assume that the objects are patients. Here, the similarity is based on having some common symptoms. If the patient, who is the most similar to the others, needs to be found, then the patients from the others groups are irrelevant. For instance, if the task is to find a new possible way to cure a certain disease that a group of patients has, then it can be useful to test it on the representative patient first. In this case, the other patients are not relevant because they have different symptoms.
The second method can be used when the members of the other groups matter. Let's assume that the objects are members of a political party. The similarity here can be based on the political view. Two politicians can be treated as similar if they share the same idea and different if they have different opinions. The leader of a party is expected to be similar to the others in the same party but different from the members of the other parties.
Another good but a little extreme example is if the objects are members of an organized crime family. Two gangsters are similar if they like each other and different if they do not. The boss of the family should be liked in the family but disliked in the other families. Fig. 3 and Fig. 4 show the difference between the two methods. In Fig. 3 the first method was used. The member A was the representative of cluster 1 because there are seven objects that are similar to A, and no such ones that are different from A (α = 7, β= 0, γ=2, δ = 2, v=2, w=2, u=1). So its r 1 value is maximal.
In Fig. 4 the second method was used. Here, the member F was the representative of cluster 1. Its r 1 is less than that of member A because it has only 6 similar objects. However, the r 2 value is higher than that of member A because there are no objects that are similar to F and are in different clusters, while the member A has 2 objects (I, J) that are similar to it and are in cluster 2 (α = 6, β= 0, γ=0, δ = 2, v=2, w=2, u=1).

The developed software
The authors of this article wrote software, which can help visualize the method. The software can be downloaded from [15]. The graphical user interface of the software can be seen in the following Fig. 5. First, the user gives the number of points, and then the points are generated in a twodimensional interval, which is also given by the user. (These options can be given on the left panel of the user interface.) The base of the tolerance relation is the Euclidean distance of the objects as described in the previous section at (2). After generating the input points, the software finds a quasi-optimal partition using a genetic search algorithm. The pseudo-code of the algorithm, which is used in the software, can be seen in Code 1 below. Fig. 6 -Fig. 8 show the output of the software for 50 points. The similarity threshold was set to 50 and the difference threshold to 90.
The v,w weights were set to 2 and the u was set to 1, respectively. The pseudo-code of the algorithm  6 shows the clusters generated by a genetic algorithm. Fig. 7 presents the output of the first method. The representative is denoted by the plus sing. In almost every cluster the representatives are centroids. This was expected because the similarity was based on the Euclidean distance of the objects. Fig. 8 presents the output of the second method. The most important difference can be observed in the cluster denoted by the star sign. In the second method, the representative is near the edge of the figure, so it is the farthest (most different) object in the cluster.

Conclusion
Correlation clustering is very effective method in many fields. In this paper the authors showed a possible way to define one object for each cluster, which is the most similar to the other objects in the same group. This member is called representative because it represents the others. The authors provided software, which can visualize the method. This program uses only random two-dimensional points. It can be interesting to test the method on real life data. In the paper, a genetic search algorithm was used. In the future, other algorithms could be implemented. It could be worth checking how the algorithms can affect the representatives.