Presentations & Posters

2025

Beyond Tokenization: Evolutionarily Grounded Units for Protein Language Understanding

Talk

Burak Suyunu, Özdeniz Dolu and Arzucan Özgür

18th The International Symposium on Health Informatics and Bioinformatics (HIBIT 2025)

Date: October 2025

Location: Istanbul Medipol University, Türkiye

TBD

Protein Language Understanding Tokenization Evolutionary Biology Natural Language Processing

GNNMutation: a heterogeneous graph-based framework for cancer detection

Poster

Nuriye Özlem Özcan Şimşek, Arzucan Özgür and Fikret S. Gürgen

33rd Conference on Intelligent Systems for Molecular Biology (ISMB) / 24th European Conference on Computational Biology (ECCB 2025)

Date: July 2025

Location: Liverpool, United Kingdom

TBD

Cancer Detection Graph Neural Networks Heterogeneous Graphs Genomics

2024

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Paper Presentation

Burak Suyunu, Enes Taylan and Arzucan Özgür

2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Date: December 2024

Location: Lisbon, Portugal

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400–6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins. Our work contributes to the ongoing dialogue between bioinformatics and natural language processing, offering insights for future development of protein-specific tokenization approaches.

DOI Code 📄 Related Paper Read More

Protein Sequence Language Processing Subword Tokenization Byte-Pair Encoding Linguistic Laws

Finding Protein Language Units using Evolutionary Protein-Oriented Subword Tokenization Methods

Poster

Burak Suyunu and Arzucan Özgür

23rd European Conference on Computational Biology (ECCB 2024)

Date: September 2024

Location: Turku, Finland

TBD

Protein Sequence Tokenization Evolutionary Biology Language Processing

Protein Function Prediction using Graph Neural Networks with Substructure Hierarchy

Poster

Gökçe Uludoğan, Elif Özkırımlı and Arzucan Özgür

23rd European Conference on Computational Biology (ECCB 2024)

Date: September 2024

Location: Turku, Finland

TBD

Protein Function Prediction Graph Neural Networks Substructure Hierarchy Bioinformatics