.Astera Labs (NASDAQ: ALAB) provides rack-scale AI infrastructure through purpose-built connectivity solutions. By collaborating with hyperscalers and ecosystem partners, Astera Labs enables organizations to unlock the full potential of modern AI. Astera Labs’ Intelligent Connectivity Platform integrates CXL®, Ethernet, NVLink, PCIe®, and UALink™ semiconductor-based technologies with the company’s COSMOS software suite to unify diverse components into cohesive, flexible systems that deliver end-to-end scale-up, and scale-out connectivity. The company’s custom connectivity solutions business complements its standards-based portfolio, enabling customers to deploy tailored architectures to meet their unique infrastructure requirements. Discover more at www.asteralabs.com.

About the Role

We are seeking a Performance Analysis Engineer to drive system-level performance optimization across large-scale AI training and inference environments. In this role, you will analyze, profile, and optimize distributed workloads running on high-density accelerator clusters, working across the full stack, from ML frameworks and communication libraries to network fabrics and hardware architecture.

You will play a critical role in ensuring that next-generation AI workloads achieve near-peak hardware efficiency, while directly influencing software architecture, infrastructure design, and future silicon and networking roadmaps.

Job Duties

Cluster-Scale Performance Profiling

  • Execute and profile state-of-the-art training and inference workloads (e.g., LLMs, diffusion models) across large-scale accelerator clusters.

  • Identify and resolve bottlenecks across compute, memory bandwidth, and interconnect latency that impact end-to-end Job Completion Time (JCT).

Collective Library Optimization

  • Tune and optimize distributed communication backends such as NCCL, RCCL, and MPI.

  • Improve efficiency of collective operations including All-Reduce, All-to-All, Reduce-Scatter, and broadcast to minimize synchronization overhead.

Network Fabric Analysis

  • Conduct deep-dive analysis of network performance, diagnosing issues such as packet loss, congestion, head-of-line blocking, and tail latency.

  • Partner with infrastructure teams to improve network behavior under real-world AI workloads.

Advanced Load Balancing & Traffic Optimization

  • Design and implement intelligent load-balancing strategies and traffic-shaping algorithms.

  • Prevent network and compute “hot spots” in high-density AI clusters and improve workload fairness and throughput.

PyTorch Stack Optimization

  • Leverage advanced PyTorch capabilities including DistributedDataParallel (DDP), Fully Sharded Data Parallel (FSDP), and torch.compile.

  • Optimize execution graphs, runtime traces, and memory usage for maximum hardware efficiency.

GPU & Accelerator Utilization

  • Apply best practices in kernel fusion, mixed-precision execution (FP16/FP8/INT8), and memory management.

  • Reduce idle “bubble” time and drive sustained peak FLOPS utilization during training and inference.

Performance Modeling & Benchmarking

  • Build automated benchmarking suites and performance regression tests.

  • Develop quantitative models to predict how architectural changes (e.g., attention mechanisms, batch sizes, parallelism strategies) scale across different cluster topologies.

Hardware–Software Co-Design

  • Collaborate closely with systems, infrastructure, and silicon teams to translate performance findings into actionable requirements.

  • Influence the design of next-generation AI accelerators, NICs, and interconnects.


Requirements & Qualifications

  • Education:
    Bachelor’s, Master’s, or PhD in Computer Engineering, Electrical Engineering or a related field.

  • Hands-on experience optimizing distributed ML workloads across multi-node accelerator clusters.
  • Strong understanding of data parallelism, model parallelism, and pipeline parallelism.
  • Deep knowledge of GPU or accelerator architectures, including compute units, memory hierarchies, and interconnects (PCIe, NVLink, or equivalents).
  • Experience working with NCCL, RCCL, MPI, or similar collective communication frameworks.
  • Strong understanding of high-performance networking (Ethernet, InfiniBand, RoCE) and their impact on distributed workloads.
  • PyTorch & ML Systems Proficiency
  • Advanced experience with PyTorch, including distributed training internals and execution tracing.
  • Ability to diagnose and optimize framework-level and runtime bottlenecks.
  • Comfortable debugging issues across software, firmware, and hardware boundaries.
  • Strong proficiency in Python and C/C++.
  • Experience building performance analysis tools, automation, and benchmarking frameworks.
  • Ability to clearly communicate complex performance findings to cross-functional teams.
  • Comfortable working in fast-moving, ambiguous environments.

We know that creativity and innovation happen more often when teams include diverse ideas, backgrounds, and experiences, and we actively encourage everyone with relevant experience to apply, including people of color, LGBTQ+ and non-binary people, veterans, parents, and individuals with disabilities.