Software

SeqBox: RNAseq/ChIPseq reproducible analysis on a consumer game computer.

Developers: Marco Beccuti, Raffaele A. Calogero and Francesca Cordero.

SeqBox was developed to facilitate the use of computing demanding applications in the field of NGS data analysis. SeqBox uses docker containers that embed demanding computing tasks (e.g. short reads mapping) into isolated containers. This approach provides multiple advantages: (i)user does not need to install all the software on its local server; (ii)results generated by different containers can be organized in pipelines; (iii)reproducible research is guarantee by the possibility of sharing the docker images used for the analysis. SeqBox comes with the R engine, docker4seq, its graphical interface, 4SeqGUI, and all dockers images installed in the NUC6I7KYK, Intel mini-computer equipped with 32GB GB RAM and 250GB/500GB Internal SSD.

SeqBox.

HashClone: a new tool to quantify the minimal residual disease in B-cell lymphoma from deep sequencing data.

Developers: Marco Beccuti, Francesca Cordero and Greta Romano.

The HashClone strategy-based is composed of three steps: the first and second steps implement an alignment-free prediction method that identifies a set of putative clones belonging to the repertoire of the patient under study. In the third step the IGH variable E region, diversity region, and joining region identification is obtained by the alignment of rearrangements with respect to the international ImMunoGenetics information system database. Moreover, a provided graphical user interface for HashClone execution and clonality visualization over time facilitate the tool use and the results interpretation.

HashClone.

PGS: Peculiar Genes Selection.

Developers: Federica Martina, Marco Beccuti, Gianfranco Balbo, Francesca Cordero.

We present a new feature selection method based on three steps to detect class-specific biomarkers in case of high-dimensional data sets. The first step detects the differentially expressed genes according to the experimental conditions tested in the experimental design, the second step filters out the features with low discriminative power and the third step detects the class-specific features and defines the final biomarker as the union of the class-specific features. Using the proposed feature selection procedure, the classification performances of a Support Vector Machine on the imbalanced data set reach a 82% whereas other methods do not exceed 73%. The Gene Ontology enrichments performed on the signatures selected with the proposed pipeline, confirm the biological relevance of our methodology. The package PGS is available for R users.

PGS tool and datasets.

Chimera: a Bioconductor package for secondary analysis of fusion products.

Developers: Raffaele A. Calogero, Matteo Carrara, Marco Beccuti, Francesca Cordero.

Chimera is a Bioconductor package that organizes, annotates, analyses and validates fusions reported by different fusion detection tools; current implementation can deal with output from, bellerophontes, chimeraScan, deFuse, fusionCatcher, FusionFinder, FusionHunter, FusionMap, mapSplice, Rsubread, tophat-fusion, and STAR. The core of Chimera is a fusion data structure that can store fusion events detected with any of the above mentioned tools. Fu- sions are then easily manipulated with standard R functions or through the set of functionalities specifically developed in Chimera with the aim of supporting the user in managing fusions and discriminating false positives.

Official Chimera web page.

HashFilter: a tool for supporting read de-convolution.

Developers: Francesca Cordero, Marco Beccuti.

HashFilter is C++ tool implementing an innovative read de-convolution algorithm based on hash table. It was used to obtain the physical, genetic and functional sequence assembly of the barley genome in the project Advancing the Barley Genome (CRIS NUMBER: 0218967).

Official HashFilter web page.

CoCoClust: Constrained Co-clustering via Sum-Squared Residue Minimization

Developers: Ruggero G. Pensa, J-F. Boulicaut, Francesca Cordero, Maurizio Atzori, Dino Ienco.

In the generic setting of objects x attributes matrix data analysis, co-clustering appears as an interesting unsupervised data mining method. A co-clustering task provides a bi-partition made of co-clusters: each co-cluster is a group of objects associated to a group of attributes and these associations can support expert interpretations. Many constrained clustering algorithms have been proposed to exploit the domain knowledge and to improve partition relevancy in the mono-dimensional clustering case (e.g. using the must-link and cannot-link constraints on one of the two dimensions). Here, we consider constrained co-clustering not only for extended must-link and cannot-link constraints (i.e. both objects and attributes can be involved), but also for interval constraints that enforce properties of co-clusters when considering ordered domains. We provide an iterative co-clustering algorithm which exploits user-defined constraints while minimizing two different residues: Hartigan's and Cheng-Church's.

Upon request at: pensa@di.unito.it

GOClust: Gene Ontology driven Co-clustering of Gene Expression Data

Developers: Alessio Visconti, Dino Ienco, Francesca Cordero, Ruggero G. Pensa.

Official GOClust web page.