Work
I build fault-tolerant and self-healing platforms which are easy to use, learn, deploy and contribute to and operate.
- Builds Kubernetes environment to replace existing AWS ECS clusters.
- Designs and implements API Gateway based on Envoy/Gloo.
- Creates developer experience first pipeline in ArgoCD to simplify delivery from end to end.
Site Reliability Engineer
Oct 2022
- Oct 2024
I build fault-tolerant and self-healing platforms which are easy to use, learn, deploy and contribute to and operate.
- Take ownership of the core infrastructure and platform that Domain runs on
- Designs and builds Kubernetes-based platform and workflow to support business requirements.
- Creates building block to shift left security (Orca, CrowdStrike), qualify (SonarQube) and observability (ELK) into the platform and pipeline.
- Delivers capability uplifting into product teams, and then establish sufficient documentation to allow teams to self-serve.
- Uplifts and improves CI/CD and developer experience in Jenkins/Github Action/ArgoCD, migrates teams into GitOps workflow.
Site Reliability Engineer
May 2021
- Oct 2022
I optimized, improved and maintained hybrid Heroku/AWS application environments.
- Developed subject-matter expertise on Cloudflare by understanding stakeholder needs and views.
- Containerized Continuous Delivery pipelines for a suite of Ruby and NodeJS applications.
- Assisted with our team's building out all infrastructure in managed code using Terraform.
- Collaborated with peers on the development of new tools and services used by all engineers.
- Integrated and migrated monitoring solutions into meaningful observability measurement, to help stakeholders in decision making.
- Added customized instrumentations to code for adding application observability.
I leaded the project to containerize EC2-based environment for better avaliability.
- Built new NGINX(EC2)/PHP/Laravel(EC2)/MySQL(RDS) systems in AWS from code repository to fully automated test/staging/prod environment
- Implemented packer pipeline to build immutable AMI and update AutoScaling Groups, utilizing CodeDeploy to provide blue/green deployments.
- Rebuilt build/deploy pipeline to improve mean time to deliver from 2 hours to 10 minutes.
- Migrated CloudWatch Logs to SumoLogic and built SumoLogic log/metrics-based dashboards to provide operation and business visibility.
- Worked with security team to obtain SOC 2 Type 2/ISO 27001.
- Reduced AWS Cost by 10% in 6 months and improved efficiency.
I built new infrastructure-as-code .Net-based systems in AWS, deployed by CI/CD pipeline with Azure DevOps/Octopus Deploy/Atlantis/Terraform.
- Provided end-to-end performance visualization and real-time security issue identification in the production environment via SumoLogic/Site24x7/PRTG/Cloudwatch and reduced the response time from 45 min to 15 min.
- Implemented AWS automation to cut down the production update window by 50% and eliminate human dependencies by 90%.
- Converted existing manual deployments to reproducible idempotent processes in the CD pipeline.
- Performed vulnerability scans again internal system at a different level with different tools: Cloud Conformity/Rapid 7/Amazon Inspector and applied report with remediation.
- Planned, migrated and managed On-Premises Atlassian stack to Atlassian Cloud.
- Connected Okta SAML/SSO with all internal/external services.
TP IT Solutions PTY LTD
Technical Support Engineer
Apr 2016
- May 2019
I supported day-to-day operations of the company and all the customers by proactively identifying problems and analyzing future needs.
- Performed all systems administration duties, including configuration and maintenance of business services on both Windows and Linux-based, physical and VMware environments.
- Automatic provisioning of enterprise infrastructure and application environment through Gitlab, Terraform, Ansible and PowerShell in multi-stack instances.
- Managed and validated auto-restored backups with PowerShell scripting.
- Designed and implemented High Available AWS solutions for clients with Auto Scale EC2, ELB, Amazon RDS and S3.
Sichuan S.Y.S Software System Co.,Ltd
Linux Engineer
Nov 2013
- Feb 2016
I managed over 1200 RHEL virtual machines on VMware vSphere infrastructure of more than 250 servers located in three different data centres.
- Dedicated project manager with responsibility for Linux automation platform deployment, leading a team of five technical people.
- Provided Level 2 and Level 3 support for business-critical systems.
- Planned a disaster recovery strategy with the standard of three data centres.
- Analyzed multi-dimensional system issues including performance tuning, kernel panic trouble-shooting, application trouble-shooting, resource allocating, and system security enhancement.
I was responsible for management, strategy and execution of IT infrastructure in Talend Beijing Office.
- Administrated over 300 Centos/Ubuntu mixed virtual machines on VMware vSphere infrastructure of more than 20 servers.
- Participated in testing and operating OTRS IT management system, providing improved customer SLA for internal staff with supporting and tracking.
- Experienced in agile development and configuration management technologies such as Kickstarts, Cobbler and Ansible.
- Managed and maintained VMware vSphere clusters.
Chongqing Dongxiang Medical Apparatus and Instruments Co., Ltd
IT Engineer
Aug 2008
- Apr 2009
Solo engineer managed all Linux servers and IT environment.