Software

MeSoOnTV: Tracking and Analyzing TV Content on the Web through Social and Ontological Knowledge

Developers: C. Schifanella, L. Vignaroli, R. Teraoni Prioletti, A. Antonini, R.G. Pensa, M.L. Sapino

AID-ISA: Additional Information-Driven Bi-clustering of Gene Expression Data

Developers: A. Visconti, F. Cordero, R.G. Pensa

iHiCC: Incremental Flat and Hierarchical Co-Clustering

Developers: D. Ienco, R.G. Pensa, R. Meo

The accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. We propose an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load.

Reference:

R.G. Pensa, D. Ienco, R. Meo. Hierarchical Co-Clustering: Off-line and Incremental Approaches. Data Min. Knowl. Discov. Springer. 2012. Published Online.

Availability:

Upon request at:

CoStar: Parameter-less Co-Clustering for Star-structured Heterogeneous Data

Developers: D. Ienco, C. Robardet, R.G. Pensa, R. Meo

The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. We present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman-Kruskal's tau, a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables. We extend tau to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes tau by a local search approach.

Reference:

D. Ienco, C. Robardet, R.G. Pensa, R. Meo. Parameter-Less Co-Clustering for Star-Structured Heterogeneous Data. Data Min. Knowl. Discov. Vol. 26(2) 2013. pp 217-254. Springer.

Availability:

Upon request at:

HiCC: Hierarchical Co-Clustering

Developers: D. Ienco, R.G. Pensa, R. Meo

Clustering high-dimensional data is challenging. Classic metrics fail in identifying real similarities between objects. Moreover, the huge number of features makes the cluster interpretation hard. To tackle these problems, several co-clustering approaches have been proposed which try to compute a partition of objects and a partition of features simultaneously. Unfortunately, these approaches identify only a predefined number of flat co-clusters. Instead, it is useful if the clusters are arranged in a hierarchical fashion because the hierarchy provides insides on the clusters. We propose a novel hierarchical co-clustering algorithm, which builds two coupled hierarchies, one on the objects and one on features thus providing insights on both them. Our approach does not require a pre-specified number of clusters, and produces compact hierarchies because it makes n-ary splits, where n is automatically determined.

Reference:

D. Ienco, R.G. Pensa, R. Meo. Parameter-free Hierarchical Co-Clustering by n-Ary Splits. In Proceedings of the 20th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases ECML PKDD 2009. September 7-11, 2009, Bled, Slovenia. LNCS 5781, pp 580-595. © Springer.

Availability:

Upon request at:

DILCA: Distance Learning for Categorical Attributes

Developers: D. Ienco, R.G. Pensa, R. Meo

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of a categorical attribute, since the values are not ordered. DILCA allows one to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute Ai can be determined by the way in which the values of the other attributes Aj are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of Ai a low value of distance is obtained. We propose also a solution to the critical point for the choice of the attributes Aj. Our framework may be embedded, for instance, in a hierarchical clustering algorithm.

References:

D. Ienco, R.G. Pensa, R. Meo. From Context to Distance: Learning Dissimilarity for Categorical Data Clustering. Trans. in Knowledge Discovery in Data. Vol. 6(1) 2012. pp 1:1-1:25. ACM Press.

D. Ienco, R.G. Pensa, R. Meo. Context-based Distance Learning for Categorical Data Clustering. In Proceedings of the 8th International Symposium on Intelligent Data Analysis IDA 2009. August 31 - September 2, 2009, Lyon, France. LNCS 5772, pp 83-94. © Springer.

Availability:

Upon request at:
Within Weka 3 at: Weka svn repository

CoCoClust: Constrained Co-clustering via Sum-Squared Residue Minimization

Developers: R.G. Pensa, J-F. Boulicaut, F. Cordero, M. Atzori, D. Ienco

In the generic setting of objects x attributes matrix data analysis, co-clustering appears as an interesting unsupervised data mining method. A co-clustering task provides a bi-partition made of co-clusters: each co-cluster is a group of objects associated to a group of attributes and these associations can support expert interpretations. Many constrained clustering algorithms have been proposed to exploit the domain knowledge and to improve partition relevancy in the mono-dimensional clustering case (e.g. using the must-link and cannot-link constraints on one of the two dimensions). Here, we consider constrained co-clustering not only for extended must-link and cannot-link constraints (i.e. both objects and attributes can be involved), but also for interval constraints that enforce properties of co-clusters when considering ordered domains. We provide an iterative co-clustering algorithm which exploits user-defined constraints while minimizing two different residues: Hartigan's and Cheng-Church's.

References:

R.G. Pensa, J-F. Boulicaut, F. Cordero, M. Atzori. Co-clustering Numerical Data under User-defined Constraints. Statistical Analysis and Data Mining, Vol. 3(1) 2010. pp 38-55. © Wiley-Blackwell.

R.G. Pensa, J-F. Boulicaut. Constrained Co-clustering of Gene Expression Data. Proceedings of the 2008 SIAM International Conference on Data Mining SDM'08, April 24-26, 2008, Atlanta, GA, USA, pp. 25-36.

Availability:

Upon request at:

GOClust: Gene Ontology driven Co-clustering of Gene Expression Data

Developers: A. Visconti, D. Ienco, F. Cordero, R.G. Pensa

Official GOClust web page.