Vasif Vahab
ai.infrastructure.engineer

Vasif Vahab
AI Infrastructure
& Systems Engineer

Over a decade of experience in enterprise infrastructure, now deeply focused on AI operations — running production GPU clusters, HPC schedulers, and high-speed storage fabrics that power machine learning at scale. From bare-metal NVIDIA DGX/HGX deployments to Slurm, Kubernetes, and Dragonfly clusters, with Infiniband, RoCE, and Lustre under the hood — the full stack that keeps AI workloads training and inferencing without downtime. Currently working with custom AI models including Gemma 3, LLaMA 3, Mistral, Qwen 2.5, DeepSeek, Phi-4, and NVIDIA NIM — deployed and optimized across GPU clusters for production inference.

10+
Years Experience
5
GPU Platforms
6+
AI Clusters Live
vasif@dgx-cluster ~
$ neofetch --role
Senior System Engineer — AI Infrastructure
$ sinfo --clusters
slurm k8s dragonfly elasticsearch mysql-galera wazuh
$ cat /gpu/inventory
DGX · HGX H200 · A100 · DGX Spark · DGX Station
$ cat /storage/fabric
Infiniband · RoCE · Lustre · Mellanox DOCA
$ cat /nvidia/stack
CUDA · Fabric Manager · Container Toolkit · Driver
$ cat /cloud/aws
VPC · TGW · VPN · WAF · ALB · S3 · RDS · EC2
$
career history
Experience
AVRIOC Technologies L.L.C 📍 Abu Dhabi, UAE
Senior System Engineer — AI Infrastructure
Feb 2020 — Present
🧠 AI GPU Cluster Operations — End-to-end lifecycle of NVIDIA DGX, HGX H200, and A100 clusters including rack & stack, OS deployment, NVIDIA driver, CUDA toolkit, CUDA container toolkit, and Fabric Manager
⚙️ Slurm HPC Clusters — Workload scheduling, GPU partitioning, and resource allocation across multi-node DGX/HGX environments
☸️ Kubernetes for AI — Production K8s clusters for containerized AI workloads with scalable, reproducible ML environments
🐉 Dragonfly + Redis Sentinel — High-throughput caching and data distribution across GPU nodes with Sentinel-managed failover
🔍 Elasticsearch Cluster — AI workload logging, analytics, and observability at scale
🗄️ MySQL Galera Cluster — High-availability database services supporting AI platform operations
💾 High-Speed Storage Fabrics — Infiniband storage, RoCE (RDMA), and Lustre filesystem with custom kernel builds for AI data pipelines
🌐 Mellanox DOCA & Networking — High-performance Infiniband and RoCE networking on GPU nodes
🖥️ NVIDIA Software Stack — Full stack management: driver, CUDA toolkit, container toolkit, Fabric Manager, NVLink/NVSwitch topology
🤖 Multi-GPU Platforms — Hands-on production experience with H200, A100, DGX Station, DGX Spark, and DGX systems
🦀 AI Model Integration — Zeroclaw platform for production inference and model serving — Gemma 3, LLaMA 3, Mistral, Qwen 2.5, DeepSeek, Phi-4, NVIDIA NIM
📊 Victoria Metrics — High-performance monitoring and metrics collection across AI cluster infrastructure
🖥️ HPE Cray XD670 — Datacenter server setup: rack mounting, BIOS/firmware, OS deployment
🔧 MAAS 3.6.1 — Automated bare-metal server provisioning and lifecycle management
📋 Netbox DCIM/IPAM — Datacenter inventory: IP tracking, rack assignments, VLANs, asset management
🔄 CI/CD Pipelines — GitLab + Jenkins for Laravel-based applications, automating build, test, and deploy
☁️ AWS VPC Architecture — VPC and subnet design with public/private isolation, NAT gateways, security groups
🌐 Hybrid Cloud Networking — VPN tunnels, Transit Gateway (TGW) for on-prem to AWS connectivity, TGW routing, cross-account
🛡️ AWS Edge Security — WAF configuration, SSL termination at ALB with ACM certificate management
📦 AWS Core Services — S3, ALB, RDS, EC2, Route 53 DNS across production environments
🔒 Wazuh Cluster — Distributed security monitoring, threat detection, and compliance
🛠️ IT Tools Platform — Deployed 12 tools for department operations: Ansible AWX, BookStack, Excalidraw, GitLab, Grafana, MAAS, Paperless-ngx, Prometheus, Stirling PDF, Wazuh, Xibo, Zabbix
🤖 Infrastructure Automation — Ansible-based provisioning for consistency and reduced manual overhead
🔑 Privileged Access — CyberArk for secure privileged access controls
🐧 Virtualization & Linux — VMware vCenter operations, Linux server deployment and optimization
LC WELL DMCC 📍 Dubai, UAE
Linux Administrator
Nov 2018 — Feb 2020
🐧 Linux & MacOS Admin — System setup, administration, and performance optimization
☁️ Cloud Operations — AWS, GCP, and Linode — scalability and backup solutions
🌐 Network Services — NFS, DNS, FTP, Samba integration and protocol management
⚖️ Load Balancing — HAProxy and Galera Cluster for database performance
🛡️ Firewall & Security — iptables, UFW, router and switch management
🔄 Automation & Monitoring — Shell scripts for deployments, backup/recovery, and site migrations
technical stack
Skills & Expertise
🧠

AI Infrastructure & GPU Operations

NVIDIA DGX HGX H200 A100 DGX Station DGX Spark CUDA Toolkit CUDA Container Fabric Manager NVIDIA Driver Zeroclaw Gemma 3 LLaMA 3 Mistral Qwen 2.5 DeepSeek Phi-4 NVIDIA NIM
⚙️

HPC Clusters & Scheduling

Slurm Kubernetes Dragonfly Redis Sentinel Elasticsearch MySQL Galera GPU Partitioning
💾

High-Speed Storage & Networking

Infiniband RoCE Storage Lustre Filesystem Lustre Kernel Build Mellanox DOCA NVLink / NVSwitch RDMA
🖥️

Datacenter & Bare-Metal

HPE Cray XD670 MAAS 3.6.1 Netbox DCIM Bare-Metal Provisioning Rack & Stack
☁️

AWS Cloud Infrastructure

VPC & Subnets Transit Gateway VPN Tunnels TGW Routing WAF SSL / ACM ALB S3 RDS EC2 Route 53
🛠️

IT Tools & Platforms

Ansible AWX BookStack Excalidraw GitLab Grafana MAAS Paperless-ngx Prometheus Stirling PDF Wazuh Xibo Zabbix
🔒

Security & Monitoring

Wazuh Cluster CyberArk Victoria Metrics Prometheus Grafana ELK Stack Zabbix
🤖

Automation & CI/CD

Ansible Terraform Jenkins GitLab CI/CD Laravel Deploy Shell Scripting
🐧

OS & Virtualization

Linux Windows MacOS VMware vCenter Docker NVIDIA DGX OS
10+
Years Experience
5
GPU Platforms
6+
AI Clusters Live
12
IT Tools Deployed
8+
AWS Services
get in touch
Contact & Languages

🌐 Languages

English
Fluent
Hindi
Fluent
Malayalam
Native