|LEONE MICHELE||Cycle: XXXIII |
Section: Computer Science and Engineering
Tutor: TANCA LETIZIA
Advisor: MASSEROLI MARCO Major Research topic
:Identification, semantic annotation and comparison of chromatin states in multiple biological conditionsAbstract:
Inside the cell, the information necessary for the development and proper functioning of most living organisms is contained in the DNA, usually associated with proteins. In fact, DNA sequences of about 150 base pairs are wrapped around special proteins, termed histones, to form nucleosomes, the basic building blocks of chromatin . In the last decades, researchers have started cataloguing chromatin proteins and their modifications. This has led to the identification of a number of chromatin modifications or “marks” and the discovery of many regulatory elements throughout the genome . Chromatin, once considered as a simple scaffold to package DNA into each cell, has started to be considered as a dynamic component in the genome organization, as well as in a multiplicity of functions in genome regulation. Many studies have been carried out with the aim of simplifying chromatin complexity by dividing it into a certain number of chromatin-states, to capture known classes of genomic elements. In the first approaches, researchers began probing where modifications occur on the genome. In these attempts, they mainly looked for regions containing a particular modification mark, or a combination of two or three marks . More recently, researchers have begun to take a more systematic approach: identifying, in a few specific well known genomes, dozens of marks across the genome, computationally finding their recurring combinations and grouping them into states. In the next future, chromatin-states mapping is expected to reveal a multitude of key aspects of genome functions, and help understanding the biomolecular mechanisms that regulate these functions. Project goal
Purpose of the research project is to define an efficient computational method able to integrate and take advantage of the valuable and numerous, but heterogeneous, data publicly available in well-known big data repositories, process them to make them comparable and homogeneously characterized to infer a broad catalogue of chromatin states in multiple human tissues in normal and disease conditions. This is a novel approach compared to previous works, such as , that generated catalogues of chromatin states of a few cell types using only well-categorized data specifically produced for that purpose through costly and time consuming experiments. The final goal of the project is to comparatively evaluate the extracted information to identify genomic regions with common or specific chromatin states in different tissues, as well as genomic regions with altered chromatin states in presence of various pathologies, compared to the healthy condition of the same tissue. Such findings can provide novel insights to better understand chromatin functions and their alterations in disease conditions.
Chromatin state annotation using combinations of chromatin modifications has emerged as a powerful approach for genome annotation and detection of regulatory activity, as well as for interpreting disease-association studies. For this aim, it is necessary to identify a set of histone modifications that can accurately characterize chromatin states and cover a wide range of tissues/conditions. In light of this, we leveraged on the big GEO repository , containing very many raw and processed experimental data about chromatin modifications (histone marks), loaded by different research groups worldwide. To overcome the heterogeneity of their processing, we took advantage of their homogeneously processed version provided by the Cistrome database . Heterogeneity in the textual descriptions of their biological conditions is homogenized through the extraction of controlled semantic terms from established biomedical ontologies, using the OnASSiS Bioconductor software package; in particular, each available data sample is semantically categorized based on the biological tissue and pathological or healthy condition it represents. Data samples of common sets of histone marks for each semantic category are extracted from the Cistrome datasets using the GenoMetric Query Language (GMQL) , a high level, declarative query language for genomic big data developed by the Genomic Computing group at Politecnico di Milano, which also allows combining replicate data samples for each histone mark in each category. Finally, to identify chromatin states for each semantically categorized tissue and pathological/healthy condition, on such data samples of chromatin modifications we applied a multivariate Hidden Markow Model approach that explicitly models the presence or absence of each chromatin mark in a genome position; to this purpose we used the ChromHMM tool , which can integrate multiple chromatin datasets, such as data of various histone modifications, to discover the major re-occurring combinatorial and spatial patterns of marks. The resulting model is then used to systematically annotate all intervals of a genome in each extracted semantic category, i.e. tissue and pathological/healthy condition.
The developed computational method, applied on the publicly available data, generates, as a result, a catalog of chromatin states through the full genome of a large number of biological tissues and disease/healthy conditions semantically categorized. In the next development of the project, genomic profiles of chromatin states in different conditions will be compared, and their specific variations in the different conditions extracted and biologically evaluated. Furthermore, preliminary results will be enhanced along several aspects, such as using a higher characterization of chromatin, considering a higher number of chromatin states or using multiple histone marks to improve the accuracy of the derived information. References
- Kornberg RD. Chromatin structure: a repeating unit of histones and DNA. Science, 1974; 184: 868–871.
- Heintzman, HD, et al. DNA methylation signatures link prenatal famine exposure to growth and metabolism Nature, 2009; 459: 108–112.
- Mendenhall EM, Bernstein BE. Chromatin state maps: new technologies, new insights. Curr Opin Genet Dev. 2008; 18(2): 109–115.
- Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011)
- Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets - update. Nucleic Acids Res. 2013; 41(Database issue): D991-D995.
- Mei S, Qin Q, Wu Q, Sun H, Zheng R, Zang C, Zhu M, Wu J, Shi X, Taing L, Liu T, Brown M, Meyer CA, Liu XSl. Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse. Nucleic Acids Res. 2017; 45(D1): D658-D662.
- Masseroli M, Pinoli P, Venco F, Kaitoua A, Jalili V, Paluzzi F, Muller H, Ceri S. GenoMetric Query Language: A novel approach to large-scale genomic data management. Bioinformatics 2015; 31(12): 1881-1888.
- Ernst J, Kellis M. ChromHMM: automating chromatin state discovery and characterization. Nat Methods 2010; 9: 215-216.