Herman Wong

Lewiston, ID

DevOps Engineer with 7+ years in IT, coming from the Ops side. I focus on eliminating manual work through infrastructure automation - turning multi-week manual efforts into repeatable, version-controlled pipelines you can run with a single click.

Most of my recent work has been Terraform and Ansible on AWS. I cut Splunk cluster deployments from 153 hours of manual work to under 5, built CI/CD pipelines that deploy full environments on a single button push in roughly 45 minutes, and resolved a multi-week AWS-to-Azure VPN deadlock in two days by capturing both ends in Terraform. I also test infrastructure code locally in containers (Ansible Molecule) so configuration errors surface before anything touches a cloud account.

I also integrate AI tooling (GitHub Copilot, Claude) into infrastructure workflows - not as a gimmick, but to speed up the write-test-fix loop on Ansible roles and Terraform modules.

Key Areas of Expertise:

  • Cloud Infrastructure: AWS (EC2, VPC, EKS, Transit Gateway, IAM, S3, RDS, KMS), Azure (VPN Gateway, Entra ID), multi-cloud connectivity
  • Infrastructure as Code: Terraform, Terragrunt, Ansible (roles, collections, Molecule testing), Packer AMI/container builds
  • CI/CD & Automation: GitLab CI, GitHub Actions, container image pipelines with Trivy scanning, automated deployment orchestration
  • Monitoring & Observability: Splunk HA clusters, Elasticsearch, Prometheus, Grafana, vRealize Operations
  • Scripting: Python, PowerShell, Bash, Golang (practical modifications to existing tools)
  • AI-Augmented Workflows: Claude Code, GitHub Copilot, MCP integrations (Firecrawl, Chrome DevTools), orchestrator-agent patterns for token-aware automation
Note:

This website is a static site hosted in an AWS S3 bucket behind a CloudFront CDN. Changes are made in VSCode, sync'd to GitHub, and then deployed to AWS via a simple CI/CD pipeline with GitHub Actions.


Visits:


Experience

DevOps Engineer

Coalfire
  • Splunk HA Cluster Automation: Splunk HA cluster deployments required 153 hours of manual work per environment and only senior engineers could attempt them. Replaced fragile shell-script bootstraps with reusable Terraform configs and Ansible roles, cutting deployments to under 5 hours and making them accessible to junior engineers.
  • Multi-Cloud VPN: AWS-to-Azure site-to-site VPN had been stuck for three weeks under ClickOps. Captured both tunnel endpoints in Terraform, enabled tunnel logging to surface the encryption mismatch, and had it running in two days. Both ends now reviewable in a single diff.
  • GitLab CI/CD Pipelines: Single button push deploys Active Directory domains, PKI services, firewalls, and clustered applications (GitLab, Elasticsearch, Splunk) in roughly 45 minutes.
  • AWS-to-Azure Log Pipeline: The native Sentinel connector dropped everything except timestamp and message body, so engineers could not distinguish Lambda logs from RDS logs without reading message bodies. Built a Lambda transform that prefixes each record with account ID, log group, and stream name, and a Sentinel-side KQL transform that parses those fields back out at ingestion time.
  • AI-Assisted Development: Integrated GitHub Copilot Agent Mode with Ansible Molecule so the AI runs tests, reads failures, and fixes issues on its own loop. Significantly cut debug time on complex Ansible roles.
  • Ansible Molecule Local Testing: Set up Molecule with Docker containers so engineers could test infrastructure code locally instead of waiting on EC2 instances. Caught configuration errors before anything touched a cloud account.
  • Container Image Pipeline: Hardened images from IronBank ubi9-minimal base with Trivy scanning and OpenSCAP RHEL9 STIG checks. Runs on both GitHub Actions and an airgapped GitLab with ECR push.
  • RegScale GRC Platform Deployment: Built the full cloud infrastructure in Terraform for a governance and compliance application: EKS with fine-grained service account permissions via workload identity, multi-AZ SQL Server with managed encryption, multi-AZ persistent storage, and an AWS-managed ingress controller deployed by Helm.
May 2022 - Jan 2026

VMware Systems Administrator

General Dynamics Information Technology - SPAWAR (Government Client)
  • VDI Recompose Performance: Diagnosed VDI recompose operations taking 3-5 hours, traced to the View Agent defaulting to a slow activation path. A registry fix cut recompose time to 30 minutes per pool (97% reduction).
  • 4-Month VDI Outage Recovery: VDI pools could not be managed or recomposed for 4 months. Traced the failure to a database ID mismatch on the View Connection Server and corrected it, restoring full pool management.
  • VxRail Upgrade Recovery: Hyperconverged infrastructure upgrades had stalled for a month across 6 production enclaves despite vendor support. Identified a certificate trust issue from a recent CA rotation plus missing host file entries. Completed 5 of 6 enclaves in one week.
  • C2PC VDI Integration: Wrote a PowerShell script that resets a stale SQL database after every VDI pool recompose, registered as a Horizon View post-synchronization script. Deployed across 6 production enclaves to eliminate recurring post-recompose failures in the C2PC military command-and-control software.
  • PowerShell DSC Patch Orchestration: Wrote DSC configurations to orchestrate Windows patching across 6 production environments in a single day, sequencing primary VMs before secondaries.
  • Centralized Monitoring: Deployed vRealize Operations and Log Insight into environments that previously had no shared visibility, so engineers stopped logging into individual systems to chase logs. Content Packs cover AD, Exchange, SQL, Windows, Horizon View, and Linux.
  • Adobe Flash Killswitch Discovery: Production Horizon View admin portals were going to become inaccessible on January 12, 2021 due to a time-based killswitch built into Adobe Flash. Discovered the issue proactively by advancing the system date on a management VM, found and documented the mms.cfg allowlist configuration that bypasses the killswitch, and led the team in deploying the fix before the cutoff date.
April 2021 - April 2022

Systems Administrator

Zachary Piper Solutions - SPAWAR (Government Client)
  • WSUS Performance Optimization: Applied WSUS metadata cleanup and Microsoft-recommended optimizations across 7 air-gapped production environments. Cut monthly patch export/import from 100+ hours to 8 hours and fixed chronic delta import failures.
  • Exchange DAG Maintenance Automation: Wrote PowerShell scripts to start and stop Database Availability Group maintenance mode for Exchange mailbox servers, including service health checks before major Cumulative Update upgrades. Prevents email loss by gracefully failing over nodes prior to patching.
October 2020 - April 2021

PC Technician

Abbtech - US Army Corps of Engineers IT
  • Enterprise System Health Monitoring: Wrote multi-threaded PowerShell tools to scan 600+ computers in parallel for SCCM client health, BitLocker status, patch compliance, and Group Policy freshness. Health audits that previously took days completed in hours.
  • WPF Diagnostic Dashboard: Rebuilt a slow single-threaded VBScript diagnostic tool as a PowerShell WPF GUI using runspace pools for parallel CIM queries (DCOM, since PowerShell Remoting was disabled). Real-time status of 50+ system properties at 3-5x the speed of the legacy tool.
  • BITS-Based Software Deployment: Wrote a PowerShell script using BITS for fault-tolerant transfers of 20-30GB CAD software (AutoCAD, Revit) to remote workers during COVID. Asynchronous background transfers survived VPN disconnections, so users did not need to bring laptops on-site.
  • SCCM Enterprise Incident Investigation: Diagnosed an enterprise-wide issue where SCCM clients were uninstalling themselves across hundreds of machines. Traced the cause to a recently enabled SCCM add-on, restoring patch delivery capability.
April 2019 - September 2020

Helpdesk Technician

Hawaii Tech Support
  • Network Discovery (Auvik): Deployed Auvik network monitoring across MSP client environments. Discovered undocumented switches, firewalls, and servers still running factory-default credentials. Built baseline inventories for environments that had none.
  • OS Deployment Automation (MDT/WDS): Set up an MDT/WDS imaging server with PXE boot automation. New computers went from bare metal to fully configured with standard software in 30 minutes instead of 3+ hours of hands-on work.
  • Remote Endpoint Remediation: Figured out how to install RMM agents remotely using Webroot shell execution and domain-joined psexec, eliminating client site visits for management software deployment.
  • Email Deliverability: Client emails kept getting flagged as spam. Set up SPF, DKIM, and DMARC records to fix email authentication.
February 2018 - February 2019

Projects

Apache Kafka Cluster with Observability Stack

  • Built an automated deployment for a full Kafka messaging environment in containers: coordination cluster, message brokers, a Kafka UI, a monitoring stack (Prometheus + Grafana), an Nginx reverse proxy, and a Python load generator. Whole stack tested locally without touching a cloud account.
  • End-to-end encryption across the cluster with a built-in certificate authority that issues per-node keys and trust stores, all distributed by Ansible. Five Grafana dashboards deployed by configuration (no manual import) showing broker throughput, consumer lag, and coordinator health, with alerts for broker availability and under-replicated partitions.
  • Nginx reverse proxy fronts everything: Kafka UI at root, Grafana at /grafana/, with HTTPS and basic auth on all routes. Python load generator produces traffic to multiple topics at different polling delays, so the dashboards show meaningful consumer lag profiles immediately after the cluster comes up.
  • Developed with GitHub Copilot Agent Mode in a self-driving test loop. The AI runs tests, reads failures, and fixes the playbooks on its own without human intervention between iterations.
April 2026

driftctl Fork: Terraform Drift Detection Tool (Golang)

  • Forked snyk/driftctl and reworked it in Go to fix three real limitations: it never actually detected drift (only listed orphan resources), it hit AWS rate limits on large accounts, and it had no awareness of resources managed by CloudFormation.
  • Replaced 103 separate AWS API calls with a single AWS Config query covering 132 resource types. Cuts maintenance from a per-service enumerator down to one SQL-style query and eliminates the rate-limit risk. Runs with standard read-only AWS permissions.
  • Added a real Terraform plan integration so the tool compares live AWS state against what Terraform thinks should exist. That's actual attribute-level drift detection (e.g., a security group rule changed outside Terraform), not just unmanaged-vs-managed.
  • Added categorization so CloudFormation-managed resources show up separately from genuinely unmanaged ones. Default AWS resources and built-in service roles get filtered out, so the output is signal rather than noise.
  • Used Claude Code and GitHub Copilot to navigate the unfamiliar Go codebase. Demonstrates AI-assisted development on a real open-source project rather than greenfield code.
April 2026

Elastic Cloud on Kubernetes on AWS EKS

  • Built a full Kubernetes-hosted Elasticsearch stack on AWS EKS, deployed via Terragrunt with Spot instances across two availability zones for 60-80% compute savings. Service accounts get fine-grained AWS permissions through workload identity, so the cluster runs with zero static credentials.
  • ArgoCD manages the entire stack through an App of Apps pattern across 7 sync waves. A shared Kustomize base/overlay structure handles the differences between a local Rancher Desktop cluster and AWS EKS without duplicating manifests. Automated self-heal keeps the cluster in sync with Git.
  • Full Elastic stack (Elasticsearch, Kibana, Fleet Server, Elastic Agent) deployed through the ECK Operator. Elastic Agent runs as a DaemonSet collecting system and Kubernetes metrics from every node.
  • Single sign-on through Keycloak across Elasticsearch, Kibana, ArgoCD, and Kiali. Solved an EKS networking quirk by splitting public and internal sign-on calls down separate paths (public load balancer URL for the front channel, internal Kubernetes DNS for the back channel).
  • Service mesh (Istio + Kiali) for traffic observability between services. Network policies across 3 namespaces implement default-deny with explicit allow rules, so traffic between services has to be opted in rather than implied.
  • AWS load balancer with managed TLS certificates and restricted access. External DNS creates DNS records automatically from Ingress annotations, and cert-manager issues internal TLS certs for components that never leave the cluster.
February 2026

AI Job Search Automation with Orchestrator-Agent Pattern

  • Built an AI-assisted job search tool using Claude. An orchestrator coordinates isolated AI subagents so each one starts with clean context instead of drowning in accumulated job description text. The AI equivalent of breaking a long script into focused functions.
  • Searches Greenhouse, Lever, Ashby, Workday, and Hiring Cafe. A scoring framework ranks postings 0-10 and filters out disqualifying ones (wrong stack, senior-only titles, non-remote requirements) before they ever hit my queue.
  • Generates tailored resumes and cover letters from a single source CV. Strict rules prevent the AI from inventing experience that isn't actually there. Output is a validated DOCX, not free-form text.
  • Config is modular: YAML for inclusions/exclusions, markdown for scoring rules, CSV for company monitoring targets. Easy to swap out for different roles or tech stacks.
February 2026

Multi-Distribution Container Build Pipeline

  • GitHub Actions pipeline that builds 20 container images in parallel across 4 Linux distributions (Alpine, Red Hat UBI9, Ubuntu, Amazon Linux 2023) and 5 DevOps tools (Ansible, Terraform, Packer, Python, Go).
  • Every build is scanned for known vulnerabilities, secret leaks, and misconfigurations, and produces a software bill of materials so you can trace what's actually inside each image.
  • All images run as non-root by default. Shared package requirements files keep package lists from being duplicated across 20 Dockerfiles.
  • Local builds work on Apple Silicon (ARM64) so the develop-test-scan loop doesn't depend on CI.
February 2026

AWS Cloud Resume Challenge

  • Static resume site on AWS: S3 for hosting, CloudFront for HTTPS, and Route 53 for DNS. The page you're reading right now.
  • Visitor counter built with API Gateway, Lambda (Python), and DynamoDB. The Lambda uses atomic database updates so the count stays accurate even with concurrent visitors, and the API has rate limiting so a runaway crawler can't run up a bill.
  • GitHub Actions pipeline pushes content changes to S3 and refreshes the cache on every push to main. Originally used long-lived AWS keys; migrated to GitHub's OIDC federation in 2025 so no AWS credentials are stored in GitHub anymore.
  • Whole infrastructure migrated from CloudFormation (via AWS SAM) to Terraform in 2025 without downtime by importing existing resources into Terraform state and replacing the counter stack with new Terraform-managed resources.
March 2022

Skills

Operating Systems & Environments

Cloud & DevOps
  • AWS Services: EC2, S3, VPC, DynamoDB, RDS, Route 53, IAM, SSM, EKS, KMS, ASG, ELB, Kinesis Firehose, SQS, Transit Gateway, CloudFormation, Lambda, CloudWatch
  • Azure Services: VPN Gateway, Site-to-Site VPN, Entra ID, Virtual Networks, Resource Manager, Azure Monitor, Azure Policy, Azure Arc
  • Multi-Cloud Integration: Cross-cloud connectivity (AWS-to-Azure VPN), hybrid architecture
  • CI/CD Platforms: GitLab CI, GitHub Actions, Git workflows, ArgoCD for GitOps
  • Container Orchestration: Docker, Kubernetes, AWS EKS, workload identity (IRSA) for service account permissions, External DNS, AWS Load Balancer Controller, Helm package management
  • EKS Components: VPC-CNI networking, CoreDNS, kube-proxy, OIDC provider integration, multi-namespace isolation, EKS add-ons management
  • Persistent Storage: EFS integration with Kubernetes, multi-AZ mount targets, CSI drivers, StatefulSets
  • IaC Tools: Terraform, Terragrunt, AWS CloudFormation, Packer for image automation
  • Configuration Management: Ansible playbooks, roles, and collections for enterprise deployments
  • Infrastructure Testing: Ansible Molecule for container-based testing, automated validation pipelines
  • Automation Frameworks: PowerShell DSC, Selenium for web automation, Python scripting
  • Monitoring & Observability: Splunk HA Cluster, Elasticsearch, Prometheus, Grafana, CloudWatch, RDS Performance Insights

Infrastructure
  • Windows Core Services: Active Directory, GPO, DNS, PKI, PowerShell, Exchange DAG, WSUS, MDT/WDS, BITS
  • Windows High Availability: Windows Failover Clustering, File Server Clusters
  • VMware Platform: vCenter, vSphere, VxRail, vSAN, Nutanix
  • VMware VDI: Horizon View, Unified Access Gateway (UAG), App Volumes
  • VMware Monitoring: vRealize Operations Manager, vRealize Log Insight
  • VMware Automation: PowerCLI
  • Linux: Bash, Cron, LVM, Systemd
  • Server Management: WinRM, SNMP, iDRAC/iLO
  • Email Security: SPF, DKIM, DMARC

AI-Assisted Development
  • AI Tools & Integration: Claude API, Claude Code CLI, GitHub Copilot, Python Anthropic SDK
  • Model Context Protocol (MCP): Firecrawl MCP for web scraping, Chrome DevTools MCP for browser automation, built custom MCP servers
  • AI Architecture Patterns: Orchestrator-agent pattern for breaking up token-heavy workloads, each subagent gets clean context instead of accumulating noise
  • Practical Usage: AI pair programming for Terraform and Ansible, writing structured AI instructions for consistent and parseable output, building repeatable AI workflows with the same rigor as infrastructure code (version-controlled configs, modular design, testable components)

Certifications


Education

Kapiolani Community College

Associate of Applied Science
Information Technology
December 2017