Ensure cloud platform and services are designed to support and align with our overall business strategies and priorities. Activities will include building out service capabilities to match SLA requirements, capacity modeling for scale and cost, and security management.
Monitor, manage and operate our cloud service. Scale our service with required monitoring and alerting capabilities, and develop incident management, and security and compliance activities/processes.
Manage the Service Operations team to operate with a culture of business and customer centricity by maintaining SLA for each service, including incident response, problem management, and service upgrades.
Develop and drive, as the primary owner, the communication strategy for internal and external stakeholders (including customers) to convey service health, tracking against SLAs, current and historical incidents, upcoming events or upgrades.
Ensure all technical procedures are documented, reviewed and updated and actively contribute to the maintenance of operational standards & policies
Collaborate with the internal R&D team to automate infrastructure services and system administration tasks wherever possible and implement a monitoring strategy to provide rapid feedback and diagnostics in the event of a service disruption.
Create relationships with other departments,, to make sure we provide services with high availability and superior performance for all our customers.
Provide technical leadership, coaching and mentoring for the team you build, fostering a culture of accountability, innovation and team building.
At least 5 years of relevant industry experience in maintaining a high availability production environment
At least 3 years of experience with service operations and extensive knowledge of cloud infrastructure planning and operations, design and deployment, as well as system life cycle management in supporting a SaaS infrastructure
At least 3 years of team management experience in a cloud operations or customer support environments
Solid understanding of Networking/VPCs/monitoring & alerting frameworks and tools
Substantial experience in operating a high-availability cloud infrastructure
Experience with cloud platforms like Azure or AWS
Experience with running distributed systems deployed multiple geographies across the globe
Knowledge of security practices, tooling and automation
Experience with monitoring tools such as DataDog, New Relic, Grafana, Prometheus,
Experience with automationtools such as Anisble, Terrafform
Advanced knowledge of at least one scripting language such as Python or PowerShell
Experience with CI/CD toolslike Jenkins, Octopus or VSTS
Some experience with relational database systems like SQL