Changing the world through digital experiences is what Adobe’s all about. We give everyone from emerging artists to global brands everything they need to design and deliver exceptional digital experiences! We’re passionate about empowering people to create beautiful and powerful images, videos, and apps, and transform how companies interact with customers across every screen.
We’re on a mission to hire the very best and are committed to creating exceptional employee experiences where everyone is respected and has access to equal opportunity.
We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours!
The Challenge :
Are you comfortable with dev, comfortable with ops, and looking for a job that doesn’t have DevOps in the title? Do you have an intimate understanding of the operational challenges of running services at scale, and are you also committed to overcoming those challenges with software instead of manpower?
Adobe is looking for a senior level Site Reliability Engineer (SRE) who knows how to balance going fast and going big with operating safely.
Our mission is to progress, protect, and provide for the software and systems behind all of Marketo : An Adobe Company, with an ever-watchful eye on system availability, latency, performance, and capacity.
What you’ll do
Engage with product and engineering to drive and improve the whole lifecycle of operational readiness - from inception and design, through deployment, operation and refinement proactively.
Write software layers, scripts, deployment frameworks, tracers, monitors, self-healing / auto remediation tools and automate the processes.
Build and maintain software modules for use and re-use in cloud and on-premise systems automation.
Maintain business continuity by identifying and driving chances to define systems highly resilient and human-free.
Assist our software engineering team to ensure accurate monitoring and metrics are being built into applications before going to production.
Maintain up-to-date documentation on deployments, processes, and standard operating procedures / run-books.
Even after self-healing and automation done by you if complex issues arise, get involved with troubleshooting and root-cause analysis of issues across the stacks hardware, software, database, network and so on.
Participate in shared on-call schedule follow-the-sun model managed across SRE & Engineering.
What you need to succeed
Bachelor's degree in CS, IS or similar; graduate degree a plus.
5+ years of experience designing for and dealing with a large production environment.
Developing, running, and / or consuming cloud technologies such as AWS, Azure, Google Cloud Platform and related tooling : Terraform, configuration management, etc.
Recent large-scale experience developing, running and / or consuming on premise platforms and related tooling : VMware, Ansible, Chef or Puppet, configuration management, etc.
Programming (Python and Bash are our preferred scripting / shell languages) and automation skills.
Solving and system engineering exposure in Linux production environments. Experience with Linux, Internet Protocols, and Large-Scale Operations.
Experience with CI / CD tooling : Jenkins, Spinnaker, GitLab runners, Azure DevOps, etc.
Experience with designing, deploying and maintaining monitoring solutions such as Splunk, Prometheus, Check MK, etc.
Great communication, interpersonal, and teamwork skills.
Ability to work independently and own problem statements end-to-end.
Experience with relational databases such as MySQL, Postgres, and document stores such as MongoDB.
Experience deploying applications in containers using Docker and Kubernetes.
Strong intuition about system design, robustness, and scalability.
Decent Experience with Windows.