The Observability & Monitoring Lead is responsible for overseeing monitoring platforms, analyzing system performance trends, and ensuring the reliability and availability of enterprise systems. This role focuses on improving monitoring strategies, optimizing alerting mechanisms, and collaborating with engineering, SRE, and infrastructure teams to proactively detect and resolve system issues. The position also supports automation, AI-driven observability initiatives, and operational excellence across monitoring processes and tools.

Working Hours: 3:00 PM to 1:00 AM IST (Till 1:30 PM CST)

Primary Responsibilities

Trend Analysis & Problem Identification:

  • Identify recurring incident patterns, anomalies, and signs of alert fatigue that may indicate deeper systemic issues.
  • Collaborate with L2/L3 teams to review telemetry data and recommend improvements to alert thresholds, rules, and policies.
  • Provide insights that support proactive issue prevention, noise reduction, and overall monitoring refinement.

Platform Management & Optimization

  • Develop, update, and maintain dashboards that reflect real-time system health, performance metrics, and service behavior.
  • Support the ongoing adoption and optimization of Dynatrace, enhancing dashboarding and visualization capabilities for cloud and on-prem observability.
  • Assist in routine platform checks, ensuring monitoring tools remain accurate, stable, and aligned with business and operational requirements.

Leadership & Collaboration:

  • Responsible for organizing the work for the team, including planning, task breakdown, and ensuring clarity of priorities.
  • Provide structured, timely updates to leadership on progress, risks, blockers, team capacity, and delivery timelines.
  • Work closely with application teams, SRE groups, and infrastructure operations during incident triage, investigations, and routine monitoring reviews.
  • Ensure clear, timely, and effective communication with stakeholders during service-impacting events, providing status updates and context as needed.
  • Ensure adherence to engineering best practices, drives operational excellence, and maintains accountability for team delivery outcomes.

Operational Excellence:

  • Support platform stability and availability through adherence to lifecycle maintenance, patching schedules, and vulnerability management processes.
  • Contribute to the improvement of monitoring workflows, alert routing logic, runbook effectiveness, and incident management practices.

Innovation & AI Enablement:

  • Assist in exploring and adopting AI-driven capabilities that improve observability, automate root-cause identification, and reduce manual effort.
  • Contribute to internal knowledge sharing by documenting best practices, playbooks, AI reference materials, and usage guidelines (e.g., Copilot tips).

Collaboration & Leadership Support:

  • Partner with cross-functional teams to align monitoring practices with evolving business needs and operational priorities.
  • Drive end-to-end delivery of monitoring initiatives—requirements gathering, planning, execution oversight, and delivery validation.
  • Coordinate cross-team dependencies, ensure timelines are met, and proactively remove blockers for the team.
  • Provide subject-matter support for ITSM processes including incident, problem, and change management discussions.

Required Qualifications

  • Bachelor’s in Computer Science, Information Technology, or related field or equivalent experience.
  • 6+ years in Site Reliability Engineering or Observability/Monitoring engineering roles.
  • 5+ years hands-on with monitoring/observability tools: New Relic, SolarWinds, WUG.
  • 4+ years of scripting experience (JavaScript, Java, PowerShell, or others).
  • 2+ years with Azure (architecture fundamentals, observability in cloud-native and lift-and-shift contexts).
  • 4+ years scripting with Python and Bash or PowerShell for automation.
  • Experience troubleshooting complex distributed applications, leading/participating in war rooms, and performing code-level impact analysis (reading logs/stack traces and correlating with deployments and infrastructure changes).
  • Solid understanding of observability best practices (metrics, logs, traces), ITSM processes, and alert hygiene.
  • Mindset to “automate any task”.
  • Maintain associated documentation for audit and certification requirements.
  • Ensure platform stability, availability, and compliance through proactive vulnerability management and lifecycle maintenance.
  • Drive process improvements for monitoring workflows and incident management.
  • Participate in troubleshooting, capacity planning, and performance analysis activities.
  • Research new monitoring requirements and write code when required.
  • Expertise in setting up monitoring policies, rules, templates and writing scripts to accomplish monitoring requirements.
  • Excellent problem solving, communication, and cross-team collaboration skills.

Core Competencies

  • Technical Expertise
  • Analytical Thinking
  • Innovation & Continuous Improvement
  • Collaboration & Influence
  • Customer Focus