← Back to News
Paper Accepted at IEEE BIBM 2024! 🎉

Paper Accepted at IEEE BIBM 2024! 🎉

November 30, 2024

Paper Conference

We're excited to share that our paper, "Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods", by Burak Suyunu, Enes Taylan, and Arzucan Özgür, has been accepted for publication at the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

The study investigates how widely used NLP tokenization methods—BPE, WordPiece, and SentencePiece—perform on protein sequences, analyzing their effects on domain boundary preservation, encoding efficiency, and compliance with linguistic laws like Zipf’s, Brevity, and Menzerath’s. The results reveal both the promise and the limitations of transferring language models to protein-level representations.

This work bridges the gap between computational linguistics and structural bioinformatics, emphasizing the need for protein-specific tokenization methods.

The paper will be presented at IEEE BIBM 2024. Stay tuned for the full presentation!

GitHub repo and preprint will follow soon.