// ai.infrastructure.engineer

Vasif Vahab
AI Infrastructure &
Senior System Engineer

I build and operate the infrastructure that powers AI. From bare-metal GPU clusters to high-speed Infiniband fabrics — I set up the Slurm schedulers, Kubernetes control planes, RoCE storage backends, and CUDA environments that let AI teams train and deploy models at scale. Currently running NVIDIA DGX, HGX H200, and A100 clusters in production with Dragonfly, Elasticsearch, and Lustre storage under the hood.

$ whoami
vasif.vahab — ai.infra.engineer
$ uptime --career
10+ years | Abu Dhabi, UAE
$ cat /gpu/clusters
DGX · HGX H200 · A100 · DGX Spark · DGX Station
$ cat /hpc/stack
Slurm · K8s · Dragonfly · Lustre · Infiniband · RoCE
$ cat /nvidia/drivers
CUDA · Fabric Manager · Mellanox DOCA · Container Toolkit
$
// career history

Experience

AVRIOC Technologies L.L.C 📍 Abu Dhabi
Senior System Engineer — AI Infrastructure
Feb 2020 — Present
  • AI GPU Cluster Build-Out: End-to-end deployment of NVIDIA DGX, HGX H200, and A100 GPU clusters — from rack & stack in the datacenter to OS install, NVIDIA driver, CUDA toolkit, CUDA container toolkit, and Fabric Manager configuration
  • Slurm HPC Clusters: Set up and managed Slurm clusters for AI/ML workload scheduling, GPU partitioning, and resource allocation across multi-node DGX/HGX environments
  • Kubernetes for AI: Deployed and maintained production Kubernetes clusters for containerized AI workloads, enabling scalable and reproducible ML environments
  • Dragonfly + Redis Sentinel: Built Dragonfly cluster managed by Redis Sentinel for high-throughput caching and data distribution across GPU nodes
  • Elasticsearch Cluster: Deployed and operated Elasticsearch clusters for AI workload logging, analytics, and observability
  • MySQL Galera Cluster: Set up MySQL Galera cluster for high-availability database services supporting AI platform operations
  • High-Speed Storage Fabrics: Built and configured Infiniband storage, RoCE (RDMA over Converged Ethernet) storage, and Lustre filesystem with kernel build for high-throughput AI data pipelines
  • Mellanox DOCA & Networking: Installed and configured Mellanox DOCA drivers for high-performance Infiniband and RoCE networking on GPU nodes
  • NVIDIA Software Stack: Managed full NVIDIA software stack — driver installation, CUDA toolkit, CUDA container toolkit, Fabric Manager for NVLink/NVSwitch topology, and NVIDIA container runtime
  • Multi-GPU Platform Experience: Hands-on with NVIDIA H200, A100, DGX Station, DGX Spark, and DGX systems — hardware setup through production operations
  • AI Model Integration: Deployed and integrated AI models using Zeroclaw platform for production inference and model serving
  • Victoria Metrics: Implemented Victoria Metrics for high-performance monitoring and metrics collection across AI cluster infrastructure
  • HPE Cray XD670: Set up and configured HPE Cray XD670 servers in datacenter — rack mounting, BIOS/firmware, OS deployment
  • MAAS 3.6.1: Deployed MAAS (Metal as a Service) for automated bare-metal server provisioning and lifecycle management
  • Netbox DCIM/IPAM: Set up Netbox for datacenter inventory — IP tracking, rack assignments, VLANs, and asset management
  • CI/CD Pipelines: Built CI/CD pipelines with GitLab + Jenkins for Laravel-based applications, automating build, test, and deploy workflows
  • AWS VPC Architecture: Designed and managed VPC and subnet architecture with public/private isolation, NAT gateways, and security groups
  • Hybrid Cloud Networking: Set up VPN tunnels and Transit Gateway (TGW) for hybrid cloud workflows connecting on-prem to AWS, including TGW routing and cross-account connectivity
  • AWS Edge Security: Configured WAF for edge security, SSL termination at ALB with ACM certificate management
  • AWS Core Services: Managed S3, ALB, RDS, EC2, and Route 53 DNS across production environments
  • Wazuh Cluster: Deployed Wazuh cluster for distributed security monitoring, threat detection, and compliance
  • IT Tools Deployed for Dept: Implemented Ansible AWX, BookStack, Excalidraw, GitLab, Grafana, MAAS, Paperless-ngx, Prometheus, Stirling PDF, Wazuh, Xibo, and Zabbix for department IT operations
  • Infrastructure Automation: Automated infrastructure provisioning using Ansible, enhancing consistency and minimizing manual errors
  • Privileged Access Management: Managed CyberArk for secure privileged access controls
  • Virtualization: Executed VMware vCenter operations for VM management and performance optimization
  • Linux Server Operations: Oversaw Linux server deployment, optimization, and automation across the infrastructure
LC WELL DMCC 📍 Dubai
Linux Administrator
Nov 2018 — Feb 2020
  • Expertly managed the setup and administration of Linux and MacOS systems, optimizing performance across platforms
  • Enhanced scalability and reliability by administering cloud solutions and backup systems on AWS, GCP, and Linode
  • Delivered exceptional support for open-source applications and services
  • Ensured robust protocol performance by integrating network services including NFS, DNS, FTP, and Samba
  • Improved database performance with the implementation of load balancing strategies using HAProxy and Galera Cluster
  • Effectively managed firewall configurations (iptables, UFW) along with network devices including routers and switches
  • Ensured data integrity and availability through the implementation of proactive backup and recovery solutions
  • Improved efficiency by developing shell scripts for automation of application deployments and system tasks
  • Formulated comprehensive monitoring, migration, and site scripting strategies
  • Provided expert technical support to end-users for applications, creating streamlined workflows
// technical stack

Skills & Expertise

🧠 AI Infrastructure & GPU Operations

NVIDIA DGX HGX H200 A100 DGX Station DGX Spark CUDA Toolkit CUDA Container Toolkit Fabric Manager NVIDIA Driver Zeroclaw AI Model Integration

⚙️ HPC Clusters & Scheduling

Slurm Kubernetes Dragonfly + Redis Sentinel Elasticsearch Cluster MySQL Galera Cluster GPU Partitioning Workload Scheduling

💾 High-Speed Storage & Networking

Infiniband RoCE Storage Lustre Filesystem Lustre Kernel Build Mellanox DOCA NVLink / NVSwitch RDMA

🖥️ Datacenter & Bare-Metal

HPE Cray XD670 MAAS 3.6.1 Netbox DCIM/IPAM Bare-Metal Provisioning Rack & Stack BIOS / Firmware

☁️ AWS Cloud Infrastructure

VPC & Subnets Transit Gateway VPN Tunnels TGW Routing WAF SSL Termination ALB ACM S3 RDS EC2 Route 53

🔧 IT Tools & Platforms Deployed

Ansible AWX BookStack Excalidraw GitLab Grafana MAAS Paperless-ngx Prometheus Stirling PDF Wazuh Xibo Zabbix

🔒 Security & Monitoring

Wazuh Cluster CyberArk Victoria Metrics Prometheus Grafana ELK Stack Zabbix

🤖 Automation & CI/CD

Ansible Terraform Jenkins GitLab CI/CD Laravel Deploy Shell Scripting

🐧 OS & Virtualization

Linux Windows MacOS VMware vCenter Docker NVIDIA DGX OS
10+
Years Experience
5
GPU Platforms (DGX/HGX/H200/A100)
6+
AI Clusters in Production
12
IT Tools Deployed
12+
AWS Services
// get in touch

Contact & Languages

English
Hindi
Malayalam
🔗 LinkedIn Profile ▶️ YouTube — Linux