Performance Analysis Engineer Intern (Summer 2026)

Astera Labs Early Career

Overview

Posted

4 weeks ago

Internship Type

Summer

Remote Status

On-site

Location

San Jose, CA, US

Education Level

Bachelor, Master, or PhD

Education Status

Not specified

Field of Study

Computer Engineering, Electrical, Electronics and Communications Engineering

About the Role

We are seeking a Performance Analysis Engineer to drive system-level performance optimization across large-scale AI training and inference environments. In this role, you will analyze, profile, and optimize distributed workloads running on high-density accelerator clusters, working across the full stack, from ML frameworks and communication libraries to network fabrics and hardware architecture.

You will play a critical role in ensuring that next-generation AI workloads achieve near-peak hardware efficiency, while directly influencing software architecture, infrastructure design, and future silicon and networking roadmaps.

Job Duties

Cluster-Scale Performance Profiling

Execute and profile state-of-the-art training and inference workloads (e.g., LLMs, diffusion models) across large-scale accelerator clusters.
Identify and resolve bottlenecks across compute, memory bandwidth, and interconnect latency that impact end-to-end Job Completion Time (JCT).

Collective Library Optimization

Tune and optimize distributed communication backends such as NCCL, RCCL, and MPI.
Improve efficiency of collective operations including All-Reduce, All-to-All, Reduce-Scatter, and broadcast to minimize synchronization overhead.

Network Fabric Analysis

Conduct deep-dive analysis of network performance, diagnosing issues such as packet loss, congestion, head-of-line blocking, and tail latency.
Partner with infrastructure teams to improve network behavior under real-world AI workloads.

Advanced Load Balancing & Traffic Optimization

Design and implement intelligent load-balancing strategies and traffic-shaping algorithms.
Prevent network and compute “hot spots” in high-density AI clusters and improve workload fairness and throughput.

PyTorch Stack Optimization

Leverage advanced PyTorch capabilities including DistributedDataParallel (DDP), Fully Sharded Data Parallel (FSDP), and torch.compile.
Optimize execution graphs, runtime traces, and memory usage for maximum hardware efficiency.

GPU & Accelerator Utilization

Apply best practices in kernel fusion, mixed-precision execution (FP16/FP8/INT8), and memory management.
Reduce idle “bubble” time and drive sustained peak FLOPS utilization during training and inference.

Performance Modeling & Benchmarking

Build automated benchmarking suites and performance regression tests.
Develop quantitative models to predict how architectural changes (e.g., attention mechanisms, batch sizes, parallelism strategies) scale across different cluster topologies.

Hardware–Software Co-Design

Collaborate closely with systems, infrastructure, and silicon teams to translate performance findings into actionable requirements.
Influence the design of next-generation AI accelerators, NICs, and interconnects.

Requirements & Qualifications

Education:
Bachelor’s, Master’s, or PhD in Computer Engineering, Electrical Engineering or a related field.
Hands-on experience optimizing distributed ML workloads across multi-node accelerator clusters.
Strong understanding of data parallelism, model parallelism, and pipeline parallelism.
Deep knowledge of GPU or accelerator architectures, including compute units, memory hierarchies, and interconnects (PCIe, NVLink, or equivalents).
Experience working with NCCL, RCCL, MPI, or similar collective communication frameworks.
Strong understanding of high-performance networking (Ethernet, InfiniBand, RoCE) and their impact on distributed workloads.
PyTorch & ML Systems Proficiency
Advanced experience with PyTorch, including distributed training internals and execution tracing.
Ability to diagnose and optimize framework-level and runtime bottlenecks.
Comfortable debugging issues across software, firmware, and hardware boundaries.
Strong proficiency in Python and C/C++.
Experience building performance analysis tools, automation, and benchmarking frameworks.
Ability to clearly communicate complex performance findings to cross-functional teams.
Comfortable working in fast-moving, ambiguous environments.

We know that creativity and innovation happen more often when teams include diverse ideas, backgrounds, and experiences, and we actively encourage everyone with relevant experience to apply, including people of color, LGBTQ+ and non-binary people, veterans, parents, and individuals with disabilities.