Saurav Muralidharan

Senior Research Scientist, NVIDIA
Email: mail [at]

 Twitter |  LinkedIn | CV

I am a scientist at NVIDIA Research, working in the Deep Learning Efficiency Research (DLER) team. My work focuses on improving the runtime performance and efficiency of deep neural networks, especially large language models (LLMs), using techniques like model compression (sparsity, low-rank factorization, distillation, etc.) and neural architecture search (NAS).

Prior to joining NVIDIA, I completed my Ph.D. in Computer Science from the University of Utah under the guidance of Prof. Mary Hall. While at Utah, I worked on machine learning-based techniques to improve the performance, portability, and energy efficiency of GPU programs.

Recent Publications

[  Google Scholar |  DBLP ]
Flextron: Many-in-One Flexible Large Language Model
R. Cai, S. Muralidharan, G. Heinrich, H. Yin, Z. Wang, J. Kautz, P. Molchanov
ICML 2024 (Oral).   [ pdf]
HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity
Y. N. Wu, P. Tsai, S. Muralidharan, A. Parashar, V. Sze, J. Emer
arXiv 2305.12718 (2023).  [ pdf]
Uniform Sparsity in Deep Neural Networks
S. Muralidharan
Sixth Conference on Machine Learning and Systems (MLSys 2023).  [ pdf]
Efficient Sparsely Activated Transformers
S. Latifi, S. Muralidharan, M. Garland
arXiv 2208.14580 (2022).  [ pdf]

Open-Source Software

[ GitHub Profile ]

Condensa is a framework for programmable model compression in Python. It comes with a set of built-in compression operators which may be used to compose complex compression schemes targeting specific combinations of DNN architecture, hardware platform, and optimization objective. To recover any accuracy lost during compression, Condensa uses a constrained optimization formulation of model compression and employs an Augmented Lagrangian-based algorithm as the optimizer.


Tensor methods generalize matrix algebraic operations to higher-orders, and can help deep neural networks better preserve and leverage local structure. TensorLy-Torch is a PyTorch library that builds on top of TensorLy and provides out-of-the-box tensor layers. It comes with all batteries included and tries to make it as easy as possible to use tensor methods within your deep networks.

Nitro Autotuning Framework

Nitro is a programmer-directed code variant tuning framework, jointly developed by the University of Utah and NVIDIA Research. It utilizes machine learning-based classification to automatically find the best implementation (variant) of a computation for a given input. Nitro provides C++ and Python interfaces for programmers to specify variants, input dataset features, and constraints.

Professional Service