We're sorry, but CareerCircle does not work properly without JavaScript enabled. Please enable JavaScript for your browser and reload this page.

Join

Senior Site Reliability Engineer

TEKsystems

Posted Tuesday, October 28, 2025

Posting ID: JP-005635268

Atlanta, GA

Duration: 3 month w2 contract to hire

Location: 4 days onsite & 1 day remote- Raleigh, Charlotte, NC or Atlanta, GA

Top Skills' Details

1. 7+ years of experience within SRE. The hiring manager is more focused on SRE Practice (being able to bring the knowledge of production to me SLO's), less focused on the DevOps (more of a nice to have)

2. Solid scripting knowledge and experience within any of the following - Python, Go, Bash, Javascript, and Shell (does not need to be proficient in all of these, just very knowledgeable in at least one of them)

3. Main Tech Stack: Dynatrace, Datadog, ELK. Ansible experience is a nice to have as they are starting to utilize that

4. SRE certifications are a requirement as well as a bachelor's degree (see education requirement in job description)

*** Fintech experience is a very nice to have ***

Description

This customer is currently in their journey of establishing a SRE practice/platform. The goal is to build a solid observability of the platform and evaluate the tool stack they will look to implement. They have a code team in place now and are looking to augment that team with a staff engineer. What they need is someone with industry knowledge in SRE - Working with vendors, hands on technical experience, technical guidance to more junior members of team and help Lead migrations from one tool to another.

Key Functions/Duties of Position:

• Define, and track reliability and observability OKRs. This includes defining and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs).

• Implement robust monitoring and alerting systems to proactively monitor health, identify potential issues, analyze system performance, and facilitate quick response to incidents.

• Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis.

• Drive the development and implementation of automation solutions to remove “toil”, streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.

• Identifying and addressing performance bottlenecks in applications and infrastructure to improve efficiency and user experience.

• Work closely with incident management to quickly address and resolve system outages or performance issues to minimize downtime and impact on users.

• Collaborate actively with development and operations teams to implement observability and resiliency requirements in order to ensure smooth deployment and operation of software systems.

• Lead the coordination with product, development, infrastructure, and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand; anticipate growth and scalability requirements.

• Improve reliability by identifying and addressing gaps in our architecture, services, and tooling.

• Modernize disaster recovery program for both on premise and Cloud-based Berkley solutions.

• Provide technical leadership and mentorship to other engineers, fostering a culture of learning and continuous improvement.

Education Requirement

• Bachelor's degree in computer science, Information Technology, or a related field (or a combination of education and equivalent experience).

Technical ability
- Long history of triaging complex issues on bridge calls not just recent
- Experience in profiling Java/.Net code
- Ability to understand basic networking, including ability to decipher packet captures
  - Linux/AIX exposure
- Exposure to enterprise level technology stacks
- Experience training others in some capacity
- Utilizing observability platform/tools; datadog, spunk, Dynatrace etc.
The individuals we are seeking will make impacts within their company and will be irreplaceable.
- We are looking to those enterprise leaders, not just a cog in the wheel

Enterprise Technology Exposure: Brush up on AIX and RHEL/Linux environments. Even basic familiarity can demonstrate initiative.
Java Proficiency: Since Java is central to the role, consider reviewing Java fundamentals and how they relate to application performance and troubleshooting.
SQL and OpenShift: Prepare to speak to your experience or learning efforts with SQL queries and OpenShift container orchestration.
Networking Fundamentals: Review core networking concepts such as TCP/IP, DNS, firewalls, and packet analysis. Be ready to interpret packet captures.
Domain Expertise: Identify and articulate a technical area where you have depth—whether it's observability, infrastructure automation, or container orchestration.
Troubleshooting Autonomy: Practice walking through complex issues independently. Be ready to demonstrate how you would lead or contribute meaningfully on bridge calls.

Additional Recommendations:

Application Performance Monitoring: Be ready to discuss your experience with tools like AppDynamics, Datadog, or similar platforms.
Bridge Call Experience: Share examples of how you've handled high-pressure troubleshooting scenarios, especially in production environments.
Code Profiling: If applicable, mention any experience profiling Java or .NET applications to identify performance bottlenecks.
Training and Mentorship: Highlight any experience training others or contributing to team knowledge-sharing.
Enterprise Impact: Position yourself as someone who drives change and adds unique value—not just a contributor, but a leader.
Career Stability: If applicable, emphasize your consistent employment history and resilience in dynamic environments.

Experience Level

Expert Level

Compensation:$85

Contact Information

Recruiter: Justin Atkins

Phone: (770) 522-1133

Email: juatkins@teksystems.com

The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.

Hybrid

Operations

Coordinating

Information Technology

Automation

Mentorship

New Product Development

.NET Framework

Python (Programming Language)

Java (Programming Language)

SQL (Programming Language)

Scalability

Scripting

Computer Science

Resilience

Linux

Troubleshooting (Problem Solving)

DevOps

Bash (Scripting Language)

Datadog

Observability

Ansible

JavaScript (Programming Language)

Firewall

Service Level

Tooling

Software Systems

Incident Management

IT Capacity Management

Technical Leadership

IBM AIX

Financial Technology (FinTech)

OpenShift

Product Engineering

TCP/IP

Trend Analysis

Application Performance Management

Disaster Recovery

Red Hat Enterprise Linux

Go (Programming Language)

Application Monitoring

Packet Analyzer

Appdynamics

Infrastructure Automation

Service Level Objectives

Dynatrace

AIOps (Artificial Intelligence For IT Operations)

Blog

Privacy Notices

Accessibility Statement

Cookies Settings

Cookie Notice

CA Notices at Collection

Senior Site Reliability Engineer

Contact Information

Blog

JOB SEEKERS

EMPLOYERS

COMMUNITY PARTNERS

ABOUT

RESOURCES