Vasif Vahab — Senior System Engineer & AI Infrastructure Specialist

// career history

Experience

AVRIOC Technologies L.L.C 📍 Abu Dhabi

Senior System Engineer — AI Infrastructure

Feb 2020 — Present

AI GPU Cluster Build-Out: End-to-end deployment of NVIDIA DGX, HGX H200, and A100 GPU clusters — from rack & stack in the datacenter to OS install, NVIDIA driver, CUDA toolkit, CUDA container toolkit, and Fabric Manager configuration
Slurm HPC Clusters: Set up and managed Slurm clusters for AI/ML workload scheduling, GPU partitioning, and resource allocation across multi-node DGX/HGX environments
Kubernetes for AI: Deployed and maintained production Kubernetes clusters for containerized AI workloads, enabling scalable and reproducible ML environments
Dragonfly + Redis Sentinel: Built Dragonfly cluster managed by Redis Sentinel for high-throughput caching and data distribution across GPU nodes
Elasticsearch Cluster: Deployed and operated Elasticsearch clusters for AI workload logging, analytics, and observability
MySQL Galera Cluster: Set up MySQL Galera cluster for high-availability database services supporting AI platform operations
High-Speed Storage Fabrics: Built and configured Infiniband storage, RoCE (RDMA over Converged Ethernet) storage, and Lustre filesystem with kernel build for high-throughput AI data pipelines
Mellanox DOCA & Networking: Installed and configured Mellanox DOCA drivers for high-performance Infiniband and RoCE networking on GPU nodes
NVIDIA Software Stack: Managed full NVIDIA software stack — driver installation, CUDA toolkit, CUDA container toolkit, Fabric Manager for NVLink/NVSwitch topology, and NVIDIA container runtime
Multi-GPU Platform Experience: Hands-on with NVIDIA H200, A100, DGX Station, DGX Spark, and DGX systems — hardware setup through production operations
AI Model Integration: Deployed and integrated AI models using Zeroclaw platform for production inference and model serving
Victoria Metrics: Implemented Victoria Metrics for high-performance monitoring and metrics collection across AI cluster infrastructure
HPE Cray XD670: Set up and configured HPE Cray XD670 servers in datacenter — rack mounting, BIOS/firmware, OS deployment
MAAS 3.6.1: Deployed MAAS (Metal as a Service) for automated bare-metal server provisioning and lifecycle management
Netbox DCIM/IPAM: Set up Netbox for datacenter inventory — IP tracking, rack assignments, VLANs, and asset management
CI/CD Pipelines: Built CI/CD pipelines with GitLab + Jenkins for Laravel-based applications, automating build, test, and deploy workflows
AWS VPC Architecture: Designed and managed VPC and subnet architecture with public/private isolation, NAT gateways, and security groups
Hybrid Cloud Networking: Set up VPN tunnels and Transit Gateway (TGW) for hybrid cloud workflows connecting on-prem to AWS, including TGW routing and cross-account connectivity
AWS Edge Security: Configured WAF for edge security, SSL termination at ALB with ACM certificate management
AWS Core Services: Managed S3, ALB, RDS, EC2, and Route 53 DNS across production environments
Wazuh Cluster: Deployed Wazuh cluster for distributed security monitoring, threat detection, and compliance
IT Tools Deployed for Dept: Implemented Ansible AWX, BookStack, Excalidraw, GitLab, Grafana, MAAS, Paperless-ngx, Prometheus, Stirling PDF, Wazuh, Xibo, and Zabbix for department IT operations
Infrastructure Automation: Automated infrastructure provisioning using Ansible, enhancing consistency and minimizing manual errors
Privileged Access Management: Managed CyberArk for secure privileged access controls
Virtualization: Executed VMware vCenter operations for VM management and performance optimization
Linux Server Operations: Oversaw Linux server deployment, optimization, and automation across the infrastructure

LC WELL DMCC 📍 Dubai

Linux Administrator

Nov 2018 — Feb 2020

Expertly managed the setup and administration of Linux and MacOS systems, optimizing performance across platforms
Enhanced scalability and reliability by administering cloud solutions and backup systems on AWS, GCP, and Linode
Delivered exceptional support for open-source applications and services
Ensured robust protocol performance by integrating network services including NFS, DNS, FTP, and Samba
Improved database performance with the implementation of load balancing strategies using HAProxy and Galera Cluster
Effectively managed firewall configurations (iptables, UFW) along with network devices including routers and switches
Ensured data integrity and availability through the implementation of proactive backup and recovery solutions
Improved efficiency by developing shell scripts for automation of application deployments and system tasks
Formulated comprehensive monitoring, migration, and site scripting strategies
Provided expert technical support to end-users for applications, creating streamlined workflows

// technical stack

Skills & Expertise

🧠 AI Infrastructure & GPU Operations

NVIDIA DGX HGX H200 A100 DGX Station DGX Spark CUDA Toolkit CUDA Container Toolkit Fabric Manager NVIDIA Driver Zeroclaw AI Model Integration

⚙️ HPC Clusters & Scheduling

Slurm Kubernetes Dragonfly + Redis Sentinel Elasticsearch Cluster MySQL Galera Cluster GPU Partitioning Workload Scheduling

💾 High-Speed Storage & Networking

Infiniband RoCE Storage Lustre Filesystem Lustre Kernel Build Mellanox DOCA NVLink / NVSwitch RDMA

🖥️ Datacenter & Bare-Metal

HPE Cray XD670 MAAS 3.6.1 Netbox DCIM/IPAM Bare-Metal Provisioning Rack & Stack BIOS / Firmware

☁️ AWS Cloud Infrastructure

VPC & Subnets Transit Gateway VPN Tunnels TGW Routing WAF SSL Termination ALB ACM S3 RDS EC2 Route 53

🔧 IT Tools & Platforms Deployed

Ansible AWX BookStack Excalidraw GitLab Grafana MAAS Paperless-ngx Prometheus Stirling PDF Wazuh Xibo Zabbix

🔒 Security & Monitoring

Wazuh Cluster CyberArk Victoria Metrics Prometheus Grafana ELK Stack Zabbix

🤖 Automation & CI/CD

Ansible Terraform Jenkins GitLab CI/CD Laravel Deploy Shell Scripting

🐧 OS & Virtualization

Linux Windows MacOS VMware vCenter Docker NVIDIA DGX OS

10+

Years Experience

GPU Platforms (DGX/HGX/H200/A100)

AI Clusters in Production

IT Tools Deployed

12+

AWS Services

Vasif Vahab
AI Infrastructure &
Senior System Engineer

Experience

Skills & Expertise

🧠 AI Infrastructure & GPU Operations

⚙️ HPC Clusters & Scheduling

💾 High-Speed Storage & Networking

🖥️ Datacenter & Bare-Metal

☁️ AWS Cloud Infrastructure

🔧 IT Tools & Platforms Deployed

🔒 Security & Monitoring

🤖 Automation & CI/CD

🐧 OS & Virtualization

Contact & Languages

Vasif VahabAI Infrastructure &Senior System Engineer

Experience

Skills & Expertise

🧠 AI Infrastructure & GPU Operations

⚙️ HPC Clusters & Scheduling

💾 High-Speed Storage & Networking

🖥️ Datacenter & Bare-Metal

☁️ AWS Cloud Infrastructure

🔧 IT Tools & Platforms Deployed

🔒 Security & Monitoring

🤖 Automation & CI/CD

🐧 OS & Virtualization

Contact & Languages

Vasif Vahab
AI Infrastructure &
Senior System Engineer