Haoyu He (何灏宇)

Tübingen AI Center

Tübingen, Germany

Hi! I am a third-year PhD student in Language Modeling at the Univeristy of Tübingen and Tübingen AI Center. I am fortunate to be in the Autonomous Vision Group (AVG) led by Prof. Andreas Geiger.

My research interests are generally driven by my research vision - to build computationally efficient intelligent agents that liberate human labor. Currently, I am doing research on diffusion language models motivated by my reflections on current LLMs: (1) the redundant long reasoning process of current LLMs could be caused by the noise accumulation of autoregressive generation, (2) memory-bound Diffusion LMs have the potential to be much faster on inference than compute-bound autoregressive models given sufficient GPUs. Therefore, let’s make diffusion language models great!

Besides, I am also interested in shifting the paradigm under the linearization assumption where everything in the inputs is flattened, to models that can use ubiquitous hierarchies effectively. For example, concept models, hierarchical models…

My life is somehow sports-centric. I am a crazy enthusiast in cycling 🚴🚵. I play tennis 🎾 and ski ⛷️ as well.

My CV is here

news

Aug 18, 2025	Our paper “MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models” is released on arXiv!
May 24, 2025	Our paper “Scholar Inbox: Personalized Paper Recommendations for Scientists” is accepted to ACL 2025 Systems Demonstration Track! Welcome on board https://www.scholar-inbox.com/
Oct 31, 2024	Our paper “NN4SysBench: Characterizing Neural Network Verification for Computer Systems” is accepted to NeurIPS 2024!
Jul 10, 2024	Our paper “HDT: Hierarchical Document Transformer” is accepted to COLM 2024!

selected publications

MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

Haoyu He, Katrin Renz, Yong Cao, and 1 more author

2025

arXiv HTML Code
HDT: Hierarchical Document Transformer

Haoyu He, Markus Flicke, Jan Buchmann, and 2 more authors

In First Conference on Language Modeling, 2024

arXiv HTML Code
NN4SysBench: Characterizing Neural Network Verification for Computer Systems

Shuyi Lin, Haoyu He, Tianhao Wei, and 5 more authors

In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

HTML Code
Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Haoyu He, Xingjian Shi, Jonas Mueller, and 3 more authors

In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, Nov 2021

Abs arXiv Code

Knowledge Distillation (KD) offers a natural way to reduce the latency and memory/energy usage of massive pretrained models that have come to dominate Natural Language Processing (NLP) in recent years. While numerous sophisticated variants of KD algorithms have been proposed for NLP applications, the key factors underpinning the optimal distillation performance are often confounded and remain unclear. We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component’s contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MI-objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyper-parameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.