A global cybersecurty company is looking for a Site Reliability Engineer!
Ensure cloud platform and services are designed to support and align with our overall business strategies and priorities. Activities will include building out service capabilities to match SLA requirements, capacity modeling for scale and cost, and security management.
Monitor, manage and operate our cloud service. Scale our service with required monitoring and alerting capabilities, and develop incident management, and security and compliance activities/processes.
Work closely with R&D to make sure new features are reliable, easily deployable, and support the requirements of the service in terms of scale and security.
Establish a regular operational feedback cycle into our engineering teams
Manage the Service Operations team to operate with a culture of business and customer centricity by maintaining Varonis SLA for each service, including incident response, problem management, and service upgrades.
Develop and drive, as the primary owner, the communication strategy for internal and external stakeholders (including customers) to convey service health, tracking against SLAs, current and historical incidents, upcoming events or upgrades.
Ensure all technical procedures are documented, reviewed and updated and actively contribute to the maintenance of operational standards & policies
Collaborate with Varonis Support team to understand and improve user experience, performance, incident response, and the serviceability of our offerings.
Collaborate with the internal R&D team to automate infrastructure services and system administration tasks wherever possible and implement a monitoring strategy to provide rapid feedback and diagnostics in the event of a service disruption.
Create relationships with other departments, including Marketing, Product Management, Engineering, and Customer Success, to make sure we provide services with high availability and superior performance for all our customers.
Provide technical leadership, coaching and mentoring for the team you build, fostering a culture of accountability, innovation and team building.
At least 4 years of relevant industry experience in maintaining a high availability production environment
At least 3 years of experience with service operations and extensive knowledge of cloud infrastructure planning and operations, design and deployment, as well as system life cycle management in supporting a SaaS infrastructure
Solid understanding of Networking/VPCs/monitoring & alerting frameworks and tools
Substantial experience in operating a high-availability cloud infrastructure
Experience with cloud platforms like Azure or AWS
Experience with running distributed systems deployed multiple geographies across the globe
Knowledge of security practices, tooling and automation
Experience with monitoring tools such as DataDog, New Relic, Grafana, Prometheus,
Experience with automation tools such as Anisble, Terrafform
Advanced knowledge of at least one scripting language such as Python or PowerShell
Experience with CI/CD tools like Jenkins, Octopus or VSTS
Some experience with relational database systems like SQL