ai.infrastructure.engineer

Vasif Vahab
AI Infrastructure
& Systems Engineer

Over a decade of experience in enterprise infrastructure, now deeply focused on AI operations — running production GPU clusters, HPC schedulers, and high-speed storage fabrics that power machine learning at scale. From bare-metal NVIDIA DGX/HGX deployments to Slurm, Kubernetes, and Dragonfly clusters, with Infiniband, RoCE, and Lustre under the hood — the full stack that keeps AI workloads training and inferencing without downtime. Currently working with custom AI models including Gemma 4, LLaMA 3, Mistral, Qwen 2.5 — deployed and optimized across GPU clusters for production inference.

10+

Years Experience

GPU Platforms

AI Clusters Live

🔗 LinkedIn ▶️ YouTube

vasif@dgx-cluster ~

$ neofetch --role

Senior System Engineer — AI Infrastructure

$ sinfo --clusters

slurm k8s dragonfly elasticsearch mysql-galera wazuh

$ cat /gpu/inventory

DGX · HGX H200 · A100 · DGX Spark · DGX Station

$ cat /storage/fabric

Infiniband · RoCE · Lustre · Mellanox DOCA

$ cat /nvidia/stack

CUDA · Fabric Manager · Container Toolkit · Driver

$ cat /cloud/aws

VPC · TGW · VPN · WAF · ALB · S3 · RDS · EC2

career history

Experience

AVRIOC Technologies L.L.C 📍 Abu Dhabi, UAE

Senior System Engineer — AI Infrastructure

Feb 2020 — Present

🧠 AI GPU Cluster Operations — End-to-end lifecycle of NVIDIA DGX, HGX H200, and A100 clusters including rack & stack, OS deployment, NVIDIA driver, CUDA toolkit, CUDA container toolkit, and Fabric Manager

⚙️ Slurm HPC Clusters — Workload scheduling, GPU partitioning, and resource allocation across multi-node DGX/HGX environments

☸️ Kubernetes for AI — Production K8s clusters for containerized AI workloads with scalable, reproducible ML environments

🐉 Dragonfly + Redis Sentinel — High-throughput caching and data distribution across GPU nodes with Sentinel-managed failover

🔍 Elasticsearch Cluster — AI workload logging, analytics, and observability at scale

🗄️ MySQL Galera Cluster — High-availability database services supporting AI platform operations

💾 High-Speed Storage Fabrics — Infiniband storage, RoCE (RDMA), and Lustre filesystem with custom kernel builds for AI data pipelines

🌐 Mellanox DOCA & Networking — High-performance Infiniband and RoCE networking on GPU nodes

🖥️ NVIDIA Software Stack — Full stack management: driver, CUDA toolkit, container toolkit, Fabric Manager, NVLink/NVSwitch topology

🤖 Multi-GPU Platforms — Hands-on production experience with H200, A100, DGX Station, DGX Spark, and DGX systems

🦀 AI Model Integration — Zeroclaw platform for production inference and model serving — Gemma 4, LLaMA 3, Mistral, Qwen 2.5

📊 Victoria Metrics — High-performance monitoring and metrics collection across AI cluster infrastructure

🖥️ HPE Cray XD670 — Datacenter server setup: rack mounting, BIOS/firmware, OS deployment

🔧 MAAS 3.6.1 — Automated bare-metal server provisioning and lifecycle management

📋 Netbox DCIM/IPAM — Datacenter inventory: IP tracking, rack assignments, VLANs, asset management

🔄 CI/CD Pipelines — GitLab + Jenkins for Laravel-based applications, automating build, test, and deploy

☁️ AWS VPC Architecture — VPC and subnet design with public/private isolation, NAT gateways, security groups

🌐 Hybrid Cloud Networking — VPN tunnels, Transit Gateway (TGW) for on-prem to AWS connectivity, TGW routing, cross-account

🛡️ AWS Edge Security — WAF configuration, SSL termination at ALB with ACM certificate management

📦 AWS Core Services — S3, ALB, RDS, EC2, Route 53 DNS across production environments

🔒 Wazuh Cluster — Distributed security monitoring, threat detection, and compliance

🛠️ IT Tools Platform — Deployed 12 tools for department operations: Ansible AWX, BookStack, Excalidraw, GitLab, Grafana, MAAS, Paperless-ngx, Prometheus, Stirling PDF, Wazuh, Xibo, Zabbix

🤖 Infrastructure Automation — Ansible-based provisioning for consistency and reduced manual overhead

🔑 Privileged Access — CyberArk for secure privileged access controls

🐧 Virtualization & Linux — VMware vCenter operations, Linux server deployment and optimization

LC WELL DMCC 📍 Dubai, UAE

Linux Administrator

Nov 2018 — Feb 2020

🐧 Linux & MacOS Admin — System setup, administration, and performance optimization

☁️ Cloud Operations — AWS, GCP, and Linode — scalability and backup solutions

🌐 Network Services — NFS, DNS, FTP, Samba integration and protocol management

⚖️ Load Balancing — HAProxy and Galera Cluster for database performance

🛡️ Firewall & Security — iptables, UFW, router and switch management

🔄 Automation & Monitoring — Shell scripts for deployments, backup/recovery, and site migrations

technical stack

Skills & Expertise

🧠

AI Infrastructure & GPU Operations

NVIDIA DGX HGX H200 A100 DGX Station DGX Spark CUDA Toolkit CUDA Container Fabric Manager NVIDIA Driver Zeroclaw Gemma 4 LLaMA 3 Mistral Qwen 2.5

⚙️

HPC Clusters & Scheduling

Slurm Kubernetes Dragonfly Redis Sentinel Elasticsearch MySQL Galera GPU Partitioning

💾

High-Speed Storage & Networking

Infiniband RoCE Storage Lustre Filesystem Lustre Kernel Build Mellanox DOCA NVLink / NVSwitch RDMA

🖥️

Datacenter & Bare-Metal

HPE Cray XD670 MAAS 3.6.1 Netbox DCIM Bare-Metal Provisioning Rack & Stack

☁️

AWS Cloud Infrastructure

VPC & Subnets Transit Gateway VPN Tunnels TGW Routing WAF SSL / ACM ALB S3 RDS EC2 Route 53

🛠️

IT Tools & Platforms

Ansible AWX BookStack Excalidraw GitLab Grafana MAAS Paperless-ngx Prometheus Stirling PDF Wazuh Xibo Zabbix

🔒

Security & Monitoring

Wazuh Cluster CyberArk Victoria Metrics Prometheus Grafana ELK Stack Zabbix

🤖

Automation & CI/CD

Ansible Terraform Jenkins GitLab CI/CD Laravel Deploy Shell Scripting

🐧

OS & Virtualization

Linux Windows MacOS VMware vCenter Docker NVIDIA DGX OS

10+

Years Experience

GPU Platforms

AI Clusters Live

IT Tools Deployed

AWS Services

Vasif Vahab
AI Infrastructure
& Systems Engineer

AI Infrastructure & GPU Operations

HPC Clusters & Scheduling

High-Speed Storage & Networking

Datacenter & Bare-Metal

AWS Cloud Infrastructure

IT Tools & Platforms

Security & Monitoring

Automation & CI/CD

OS & Virtualization

🌐 Languages

📬 Connect

Vasif Vahab AI Infrastructure& Systems Engineer

AI Infrastructure & GPU Operations

HPC Clusters & Scheduling

High-Speed Storage & Networking

Datacenter & Bare-Metal

AWS Cloud Infrastructure

IT Tools & Platforms

Security & Monitoring

Automation & CI/CD

OS & Virtualization

🌐 Languages

📬 Connect

Vasif Vahab
AI Infrastructure
& Systems Engineer