|LEONE MICHELE||Cycle: XXXIII |
Section: Computer Science and Engineering
Tutor: TANCA LETIZIA
Advisor: MASSEROLI MARCO Major Research topic
:Identification, semantic annotation and comparison of chromatin states in multiple biological conditionsAbstract:
The information necessary for the development and proper functioning of most living organisms is contained in the DNA, usually associated with proteins. In fact, DNA sequences of about 150 base pairs are wrapped around special proteins, termed histones, to form nucleosomes, the basic building blocks of chromatin (van Steensel, 2011). Histone modifications lead to chromatin condensation and expose DNA to the binding with transcription factors (TFs) and many other proteins, leading to the change in gene expression. In the last decades, researchers have started cataloguing chromatin proteins and their modifications. This has led to the identification of several chromatin modifications or “marks” and the discovery of many regulatory elements throughout the genome (Baker, 2011). Many studies have been carried out with the aim of simplifying chromatin complexity by dividing it into a certain number of chromatin-states, to capture known classes of genomic elements. These states, therefore, can regulate transcription in each cell type under specific conditions and are highly correlated to a multi-level set of functional genomic elements. In fact, they usually include known classes of genomic features, such as promoters, enhancers, and transcribed, repressed, and repetitive regions (de Pretis and Pelizzola, 2014). In the first approaches to discover these states, researchers began probing where modifications occur on the genome. In these attempts, they mainly looked for regions in which a specific mark, or a combination of few of them, were present in greater frequency (Heintzman et al., 2009). More recently, researchers have begun to take a more systematic approach: identifying multiple marks in a specific region or throughout the genome, computationally finding where their combinations occur and grouping these combinations into states.
The research project aims to extend the concept of chromatin states, considering histone modifications, transcription factors and all the different types of genomic features (e.g., CpG islands, partially methylated domains, transposable elements), to create a framework that, starting from a set of functional elements, identify the corresponding samples available in the most important web resources, integrate information about tissue type and possible pathological conditions through the extraction of controlled semantic terms from certain biomedical ontologies and find combinations of these genomic features. Once chromatin states have been identified, this method allows achieving a data-driven analysis, through the bi-clustering of regions and samples, the identification of genome clusters and the gene-set enrichment analysis to associate these clusters to gene functional categories. Another possibility is the Semantically driven analysis. This step requires first the identification of those samples that present information on the pathological vs. the corresponding healthy condition and then the differential HMM analysis and the gene-set enrichment analysis to identify associated gene functional categories.
Methods To identify a set of histone modifications that can accurately characterize chromatin states and cover a wide range of tissues/conditions, we leveraged on the big GEO repository, containing very many raw and processed experimental data about each experiment, loaded by different research groups world-wide. Large-scale analysis is complicated due to heterogeneity in the data processing across studies and most importantly in the metadata describing each experiment. When submitting data to the GEO repository, scientists enter experiment descriptions in a spreadsheet where they can provide unstructured information and create arbitrary fields that need not adhere to any predefined dictionary. The validity of the metadata is not checked at any point during the upload process, thus the metadata associated with gene expression data, usually does not match with standard class/relation identifiers from specialized biomedical ontologies. The resulting free-text experiment descriptions suffer from redundancy, inconsistency, and incompleteness (Zaveri et al.2019). To overcome the heterogeneity of their processing, our first idea was to take advantage of their homogeneously processed version provided by the Cistrome database (Mei et al. 2017), extracting all significant information from the metadata documents available with the OnASSiS Bioconductor software package, a Natural Language Processing tool based on Name Entity Recognition (NER). However, this procedure was found to be time consuming and subject to identifying a significative number of false positives during metadata acquisition. Consequently we provide a novel formulation of the metadata integration problem as a machine translation (MT) problem, which has a number of benefits over both a NER based approach (since there is no requirement for annotating input training sequences), and over a multi-label classification based approach (since the same model architecture can be used regardless of the target attributes to be extracted). We provide experimental evidence demonstrating the effectiveness of the transformer-based translation models over simpler attention based seq2seqmodels and over the classification-based approach using a similar transformer architecture. Experiments are performed in both homogeneous and heterogenous training/testing environments, indicating the ability of the seq2seqmodel to impute values often unobserved in the input, and the efficacy of the approach for real data integration applications.All the possible combinations of genomic features were then grouped in “states” using hidden Markov models (HMMs). Data samples of common sets of histone marks for each semantic category are extracted from the Cistrome datasets using the GenoMetric Query Language (GMQL) (Masseroli et al., 2015), a high level, declarative query language for genomic big data developed by the Genomic Computing group at Politecnico di Milano, which also allows combining replicate data samples for each histone mark in each category. Finally, to segment the genome as a function of the occurrence of functional elements that identifies co-occurrence patterns of epigenetic features, we used the hidden Markov models. HMMs are well-suited to the task of discovering unobserved ‘hidden’ states from multiple ‘observed’ inputs in their spatial genomic context.
Results The developed computational method, applied on the publicly available data, generates as a result a catalogue of chromatin states through the full genome of a large number of biological tissues and disease/healthy conditions semantically categorized. A great amount of genomic profiles of chromatin states in different conditions was compared, as well as their specific variations in the different conditions extracted. After promising results, the pipeline was enhanced along several aspects, such as using a higher characterization of chromatin, considering a higher number of chromatin states, using multiple histone marks, and adding transcription factors for the generation of the chromatin states, in order to improve the accuracy of the derived information. The next step will be to incorporate this procedure into the GMQL system to perform an automated data integration solution that integrate, replicates, and analyse large volumes of data in order to generate a highly reproducible method that allows to visualize the results of the enrichment analysis performed on the clusters of the chromatin state regions generated by HMM.
- Baker, M. (2011) Making sense of chromatin states. Nat Methods 8, 717–722. ;
- de Pretis, S. Pelizzola, M. (2014) Computational and experimental methods to decipher the epigenetic code Frontiers in Genetics 5, 335. ;
- Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ;
- Erik F., Tjong Kim Sang, Fien De Meulder (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CoNLL-2003. ;
- Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran J.R., (2013) Evaluating Entity Linking with Wikipedia, Artificial Intelligence, 194, 130-150. ;
- Heintzman, N. D. et al. (2007) Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet. 39, 311–318. ;
- Lv J, Liu H, Huang Z, Su J, He H, Xiu Y, Zhang Y, Wu Q. Long non-coding RNA identification over mouse brain development by integrative modeling of chromatin and genomic features. Nucleic acids research. 2013; 41:10044–10061. ;
- Masseroli M, Pinoli P, Venco F, Kaitoua A, Jalili V, Paluzzi F, Muller H, Ceri S. GenoMetric Query Language: A novel approach to large-scale genomic data management. Bioinformatics 2015; 31(12): 1881-1888. ;
- Mei S, Qin Q, Wu Q, Sun H, Zheng R, Zang C, Zhu M, Wu J, Shi X, Taing L, Liu T, Brown M, Meyer CA, Liu XSl. Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse. Nucleic Acids Res. 2017; 45(D1): D658-D662. ;
- Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. Zettlemoyer, L. (2018). Deep contextualized word representations. NAACL. ;
- van Steensel, B. (2011) Chromatin: constructing the big picture. EMBO J., 30, 1885-1895. ;
- Yadav, V. and Bethard, S. (2018). A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In Proceedings of the 27th International Conference on Computational Linguistics 2145–2158. ;
- Wamstad JA, Alexander JM, Truty RM, Shrikumar A, Li F, Eilertson KE, Ding H, Wylie JN, Pico AR, Capra JA, Erwin G, Kattman SJ, Keller GM, Srivastava D, Levine SS, Pollard KS, et al. Dynamic and coordinated epigenetic regulation of developmental transitions in the cardiac lineage. Cell. 2012; 151:206–220. ;
- Zaveri, A., Hu, W., Dumontier, M.: MetaCrowd: crowdsourcing biomedical meta-data quality assessment. Human Computation6(1), 98–112 (2019) ;