Job Description
As a member of this team your responsibilities will include monitoring our platforms’ availability, performance, and deep triaging of incidents.
You will also lead initiatives, evangelizing site reliability best practices that will improve our availability while ensuring optimum delivery. Contribute to Schwab’s journey to mature our Continuous Delivery processes by establishing automated processes, streamlining build and deploy methodology, and improving our software development practices.
The ideal candidate will have 5+ years of experience in a site reliability, devops and prod support team. This position requires a high level of customer service skills to establish and enhance positive relationships with peers and business partners.
- Must demonstrate strong problem-solving skills and the willingness to learn new tools/technologies.
- Current or previous experience as a Schwab employee or contractor is a firm must have
- Strong expertise in triaging product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality.
- Strong experience leading production releases and deployments across environments and cloud platforms.
- Solid understanding of CI/CD best practices using tools like Bamboo, Jenkins, Gitlab, Harness and Nexus repository
- Work with Build/Run teams to review current practices and make recommendations on adopting Continuous Delivery best practices and help enhance build, deploy, configuration management, and release engineering related activities.
- Develop and maintain configuration and release scripts. Provide release planning services and installation script development for new applications where necessary.
- Develop/improve and deploy source code branching methodologies and associated automation.
- Hands on experience in automation to develop scripts for automating build and application deployments using one or more scripting languages such as PowerShell and Bash.
- Experience using application and synthetic monitoring tools and creating monitoring dashboards, such as Splunk, AppDynamics, Dynatrace, etc.
- Experience working with Cloud computing solutions in various cloud platforms such as PCF, AWS, OpenShift or GCP.
- Hands-on experience with release and Incident management using tools like Remedy, ServiceNow or Jira.
- Develop/improve release process improvements and update release standards documentation.
- Peer review change tickets to ensure accuracy of application change activities and mitigate risk.
- Basic understanding of networking and security best practice
- Hands on experience in Shell scripting and maintain applications/services in Linux
- Basic understanding of Databases such as MongoDB, Oracle and SQL.
- Excellent communication skills, strong teamwork ethic and able to lead the team when required, be able to take accountability of portfolio of applications and follow through issues from start to finish
- Strong interpersonal, analytical, problem solving and organizational skills. Ability to independently work as a contributing member in a high-paced and focused team. Good verbal and written communication skills. Ability to thrive in a flexible and fast-paced environment across multiple time zones and locations
- Solid understanding of Agile Methodologies & tools.
- Actively participating in mentoring junior team members.
- Bachelor's degree in Computer Science or related discipline.
Key Requirements:
- DevOps - Create CI/CD framework, Version Control, Build Automation, Deploy application, and Monitor performance.
- Using tools such as Bamboo, Jenkins, Gitlab, Harness, and Nexus. Alerting and monitoring through synthetic monitoring tools like Splunk, Elk, IBM Tivoli, AppDynamics or Dynatrace, preferably Splunk Release management – Plan, schedule, test, deploy and control software releases through Jira or other release management tools like remedy.
- Automation – Develop scripts to automate to reduce any repetitive/manual steps using one or more scripting languages such as Bash/shell scripting.
- Triaging production issues, impact analysis, communicating to stakeholders and create action items to ensure platform stability. JIRA service tickets are addressed in a timely manner.