Publications
2025
STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings
M. E. Akça, Gökçe Uludoğan, Arzucan Özgür and İ. M. Baytaş
arXiv preprint
Accurate prediction of protein function is essential for elucidating molecular mechanisms and advancing biological and therapeutic discovery. Yet experimental annotation lags far behind the rapid growth of protein sequence data. Computational approaches address this gap by associating proteins with Gene Ontology (GO) terms, which encode functional knowledge through hierarchical relations and textual definitions. However, existing models often emphasize one modality over the other, limiting their ability to generalize, particularly to unseen or newly introduced GO terms that frequently arise as the ontology evolves, and making the previously trained models outdated. We present STAR-GO, a Transformer-based framework that jointly models the semantic and structural characteristics of GO terms to enhance zero-shot protein function prediction. STAR-GO integrates textual definitions with ontology graph structure to learn unified GO representations, which are processed in hierarchical order to propagate information from general to specific terms. These representations are then aligned with protein sequence embeddings to capture sequence-function relationships. STAR-GO achieves state-of-the-art performance and superior zero-shot generalization, demonstrating the utility of integrating semantics and structure for robust and adaptable protein function prediction.
BibTeX
@article{akca2025stargo,
title={STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings},
author={Akça, M. E. and Uludoğan, Gökçe and Özgür, Arzucan and Baytaş, İ. M.},
journal={arXiv preprint arXiv:2512.05245},
year={2025}
}
PUMA: Discovery of Protein Units via Mutation-Aware Merging
Burak Suyunu, Özdeniz Dolu, Ibukunoluwa A. Olaosebikan, Hacer K. Bristow and Arzucan Özgür
arXiv preprint
Proteins are the essential drivers of biological processes. At the molecular level, they are chains of amino acids that can be viewed through a linguistic lens where the twenty standard residues serve as an alphabet combining to form a complex language, referred to as the language of life. To understand this language, we must first identify its fundamental units. Analogous to words, these units are hypothesized to represent an intermediate layer between single residues and larger domains. Crucially, just as protein diversity arises from evolution, these units should inherently reflect evolutionary relationships. We introduce PUMA (Protein Units via Mutation-Aware Merging) to discover these evolutionarily meaningful units. PUMA employs an iterative merging algorithm guided by substitution matrices to identify protein units and organize them into families linked by plausible mutations. This process creates a hierarchical genealogy where parent units and their mutational variants coexist, simultaneously producing a unit vocabulary and the genealogical structure connecting them. We validate that PUMA families are biologically meaningful; mutations within a PUMA family correlate with clinically benign variants and with high-scoring mutations in high-throughput assays. Furthermore, these units align with the contextual preferences of protein language models and map to known functional annotations. PUMA's genealogical framework provides evolutionarily grounded units, offering a structured approach for understanding the language of life.
BibTeX
@article{suyunu2025puma,
title={PUMA: Discovery of Protein Units via Mutation-Aware Merging},
author={Suyunu, Burak and Dolu, Ozdeniz and Olaosebikan, Iyanuoluwa A. and Bristow, Hannah K. and Ozgur, Arzucan},
journal={arXiv preprint arXiv:2503.08838},
year={2025}
}
GNNMutation: a heterogeneous graph-based framework for cancer detection
Nuriye Özlem Özcan Şimşek, Arzucan Özgür and Fikret S. Gürgen
BMC Bioinformatics
Background: When genes are translated into proteins, mutations in the gene sequence can lead to changes in protein structure and function as well as in the interactions between proteins. These changes can disrupt cell function and contribute to the development of tumors. In this study, we introduce a novel approach based on graph neural networks that jointly considers genetic mutations and protein interac- tions for cancer prediction. We use DNA mutations in whole exome sequencing data and construct a heterogeneous graph in which patients and proteins are represented as nodes and protein-protein interactions as edges. Furthermore, patient nodes are connected to protein nodes based on mutations in the patient’s DNA. Each patient node is represented by a feature vector derived from the mutations in specific genes. The feature values are calculated using a weighting scheme inspired by information retrieval, where whole genomes are treated as documents and mutations as words within these documents. The weighting of each gene, determined by its mutations, reflects its contribution to disease development. The patient nodes are updated by both mutations and protein interactions within our noval heterogeneous graph structure. Since the effects of each mutation on disease development are different, we processed the input graph with attention-based graph neural networks. Results: We compiled a dataset from the UKBiobank consisting of patients with a can- cer diagnosis as the case group and those without a cancer diagnosis as the control group. We evaluated our approach for the four most common cancer types, which are breast, prostate, lung and colon cancer, and showed that the proposed framework effectively discriminates between case and control groups. Conclusions: The results indicate that our proposed graph structure and node updat- ing strategy improve cancer classification performance. Additionally, we extended our system with an explainer that identifies a list of causal genes which are effective in the model’s cancer diagnosis predictions. Notably, some of these genes have already been studied in cancer research, demonstrating the system’s ability to recognize causal genes for the selected cancer types and make predictions based on them.
BibTeX
@article{simsek2025gnnmutation,
title={GNNMutation: a heterogeneous graph-based framework for cancer detection},
author={Simsek, Nuriye Ozlem Ozcan and Ozgur, Arzucan and Gurgen, Fikret S.},
journal={BMC Bioinformatics},
volume={26},
pages={153},
year={2025},
publisher={BioMed Central}
}
2024
Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods
Burak Suyunu, Enes Taylan and Arzucan Özgür
2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400–6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins. Our work contributes to the ongoing dialogue between bioinformatics and natural language processing, offering insights for future development of protein-specific tokenization approaches.
BibTeX
@inproceedings{suyunu2024linguistic,
title={Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods},
author={Suyunu, Burak and Taylan, Enes and Ozgur, Arzucan},
booktitle={2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
pages={4489--4496},
year={2024},
organization={IEEE Computer Society}
}
Generative language models on nucleotide sequences of human genes
Musa Nuri İhtiyar and Arzucan Özgür
Scientific Reports
Language models, especially transformer-based ones, have achieved colossal success in natural language processing. To be precise, studies like BERT for natural language understanding and works like GPT-3 for natural language generation are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABERT in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes, i.e. unique parts of DNA with specific functions, rather than the whole DNA. This decision has not significantly changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. First of all, we systematically studied an almost entirely unexplored problem and observed that recurrent neural networks (RNNs) perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.
BibTeX
@article{ihtiyar2024generative,
title={Generative language models on nucleotide sequences of human genes},
author={{\.I}htiyar, Musa Nuri and {\"O}zg{\"u}r, Arzucan},
journal={Scientific Reports},
volume={14},
number={1},
pages={22204},
year={2024},
publisher={Nature Publishing Group UK London}
}