🧠
AI GPU Cluster Operations — End-to-end lifecycle of NVIDIA DGX, HGX H200, and A100 clusters including rack & stack, OS deployment, NVIDIA driver, CUDA toolkit, CUDA container toolkit, and Fabric Manager
⚙️
Slurm HPC Clusters — Workload scheduling, GPU partitioning, and resource allocation across multi-node DGX/HGX environments
☸️
Kubernetes for AI — Production K8s clusters for containerized AI workloads with scalable, reproducible ML environments
🐉
Dragonfly + Redis Sentinel — High-throughput caching and data distribution across GPU nodes with Sentinel-managed failover
🔍
Elasticsearch Cluster — AI workload logging, analytics, and observability at scale
🗄️
MySQL Galera Cluster — High-availability database services supporting AI platform operations
💾
High-Speed Storage Fabrics — Infiniband storage, RoCE (RDMA), and Lustre filesystem with custom kernel builds for AI data pipelines
🌐
Mellanox DOCA & Networking — High-performance Infiniband and RoCE networking on GPU nodes
🖥️
NVIDIA Software Stack — Full stack management: driver, CUDA toolkit, container toolkit, Fabric Manager, NVLink/NVSwitch topology
🤖
Multi-GPU Platforms — Hands-on production experience with H200, A100, DGX Station, DGX Spark, and DGX systems
🦀
AI Model Integration — Zeroclaw platform for production inference and model serving — Gemma 3, LLaMA 3, Mistral, Qwen 2.5, DeepSeek, Phi-4, NVIDIA NIM
📊
Victoria Metrics — High-performance monitoring and metrics collection across AI cluster infrastructure
🖥️
HPE Cray XD670 — Datacenter server setup: rack mounting, BIOS/firmware, OS deployment
🔧
MAAS 3.6.1 — Automated bare-metal server provisioning and lifecycle management
📋
Netbox DCIM/IPAM — Datacenter inventory: IP tracking, rack assignments, VLANs, asset management
🔄
CI/CD Pipelines — GitLab + Jenkins for Laravel-based applications, automating build, test, and deploy
☁️
AWS VPC Architecture — VPC and subnet design with public/private isolation, NAT gateways, security groups
🌐
Hybrid Cloud Networking — VPN tunnels, Transit Gateway (TGW) for on-prem to AWS connectivity, TGW routing, cross-account
🛡️
AWS Edge Security — WAF configuration, SSL termination at ALB with ACM certificate management
📦
AWS Core Services — S3, ALB, RDS, EC2, Route 53 DNS across production environments
🔒
Wazuh Cluster — Distributed security monitoring, threat detection, and compliance
🛠️
IT Tools Platform — Deployed 12 tools for department operations: Ansible AWX, BookStack, Excalidraw, GitLab, Grafana, MAAS, Paperless-ngx, Prometheus, Stirling PDF, Wazuh, Xibo, Zabbix
🤖
Infrastructure Automation — Ansible-based provisioning for consistency and reduced manual overhead
🔑
Privileged Access — CyberArk for secure privileged access controls
🐧
Virtualization & Linux — VMware vCenter operations, Linux server deployment and optimization