Sheng Zha

I build foundation models, open-source AI frameworks, and the teams behind them.

Head of Model Architecture & Training Advancement at Amazon AGI. VP of Apache MXNet. Founder of GluonNLP. Algo-systems co-designer.

SZ

About

I lead the research and development of foundation models at Amazon AGI, where I focus on the co-evolution of algorithms and systems—the critical intersection that makes AI more capable, efficient, and accessible. My team built the models behind Amazon Nova, Amazon Q, Titan, and the distributed training infrastructure powering Amazon Bedrock and SageMaker HyperPod.

I built this team from zero, starting in 2018 with a focus on distributed training and shared representations. What started as a small group of tech leaders and hackers grew into the engine behind foundation models serving millions through AWS.

Before that, I shaped the open-source AI ecosystem as VP and PMC Chair of Apache MXNet, where I co-authored the Gluon interface. I founded GluonNLP—the first toolkit to reproduce BERT with record-setting training speeds. I served on the ONNX Steering Committee and co-founded the Python Data API Standards Consortium. I believe accessible tools and open standards are essential for an AI future that benefits everyone.

I hold an MS in Computer Science from the University of Maryland and a BS from Shanghai Jiao Tong University.


What I’ve Built

Amazon Nova & Foundation Model Stack

Amazon AGI, 2024–present

Leading model architecture and training advancement for Amazon's next-generation foundation models. Driving the co-design of algorithms and systems to reduce the cost of intelligence across the stack.

Foundation Models for AWS AI Services

AWS, 2018–2024

Built the team and models from zero. Developed and deployed foundation models underpinning Amazon Q (CodeWhisperer), Titan, Lex, Comprehend, and Kendra. Led LLM pre-training, RLHF, and large-scale distributed training.

Distributed Training Infrastructure

AWS, 2018–2024

Contributed core technology for scalable, fault-resilient training infrastructure including Amazon Bedrock and SageMaker HyperPod. Designed systems for training at scale with efficient resource utilization.

Gluon Interface for Apache MXNet

Apache MXNet, 2016–2018

Co-authored the Gluon API—an imperative, Pythonic interface for deep learning that became the standard for MXNet. Made deep learning more accessible to researchers and engineers.

GluonNLP

Founded, 2018

Created a deep-learning NLP toolkit for the Gluon interface. First to reproduce BERT with record-setting training speeds, accelerating NLP research across the community.

ML Platform for Fraud Detection

Amazon TRMS, 2013–2015

Designed horizontally scalable machine learning platform and graph-based ML solutions for fraud and abuse detection. Built high-availability key-value stores with expressive transformation DSL for real-time feature engineering.


Selected Publications

Dive into Deep Learning for Natural Language Processing

Tutorial · EMNLP

Differentially Private Pre-training with Limited Data

Privacy-preserving foundation model training

Sequence-Level Training for Large Language Models

LLM training methodology

Sparse Mixture of Expert Models for Code Completion

Efficient model architectures


Talks & Interviews

Deep Learning Lectures

JSALT 2018 · Johns Hopkins University

Dive into Deep Learning for NLP

Conference Tutorial

More talks and podcast appearances coming soon.


Open Source