Monitoring Observability Engineer

Maxar Technologies
Longmont, CO
Added: March 26, 2024
Maxar Intelligence is a provider of secure, precise, geospatial intelligence. We deliver disruptive value to government and commercial customers to help them monitor, understand and navigate our changing planet. Our unique approach combines decades of deep mission understanding and a proven commercial and defense foundation to deploy solutions and deliver insights with unrivaled speed, scale and cost effectiveness. We are hiring immediately for a Monitoring & Observability Engineer in our Longmont, CO office.

Life With Us:

There is a reason we receive awards like Best Place to Work, Best Employer, Top Workplace and Candidate Experience winner. Our strength is in our people. Each team member makes a unique contribution to our collective mission, and we recognize that with best-in-class offerings like:

Work-life balance to include flexible working opportunities and generous time off.

Family-first benefits like adoption reimbursement, pet insurance and mental health resources.

Educational perks such as student loan repayment, paid certifications and tuition reimbursement.

Career growth opportunities including an internal mobility program and leadership training and development.

Diversified healthcare and investment options including multiple medical plans and 401(k).

What you will do day-to-day with your colleagues:

We are looking for a full-time Monitoring & Observability Engineer (MOE) to gain deeper insights to complex systems and cloud-native environments. This role is part of our Mission, Access, Reliability and Support team that ensure that Maxar's services have reliability and up-time appropriate to customers' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.

The MOE will have the mindset and a set of engineering approaches to understand “the what” and “the why.” They will build monitoring solutions and visibility into operational problems to achieve customer value and satisfaction. Their focus is to drive observability and monitoring for existing systems and provide systems insight to resolve application and infrastructure issues. The successful candidate has a breadth of knowledge to discover solutions for complex problems across the entire technology stack.

Drive service acceptance by adopting new processes into operations and developing new monitoring for exposure of risks and automating against repeatable actions

Partner with service and product owners to establish KPIs to identify trends and achieve better outcomes

Provide deep troubleshooting for production issues

Build and/or use tools to correlate disparate data sets in an efficient and automated way to help teams quickly identify root-cause issues and to understand how different problems relate to each other

Engage with service owners to maximize a team’s ability to identify and remediate root cause performance quickly and service interruption recovery.

Participate actively and critically in retrospectives that had broad impact and/or are leading indicators of potential site issues

Define standards for monitoring, reliability, and performance

Design and architect operational solutions for managing applications and infrastructure

Coordinate with Technical Services Operation Center (TSOC) to support Major Incidents, large-scale deployments, and SecOps user support

Minimum Qualifications:

U.S. Citizenship and must be willing and able to obtain a SECRET clearance and pass a Counterintelligence Scope Polygraph

Ability to obtain a DoD Directive 8140.01 compliant certification within six months of date of hire (e.g. Security+ CE, SSCP, CASP+ CE, CISSP, CCNP Security)

Bachelor’s degree in Computer Engineering/Science or equivalent experience

Minimum of 5 years of software engineering or related experience

Advanced knowledge of Unix/Linux systems, with a high comfort level at the command line

Proficient with at least one programming language (e.g., Python, Ruby, Java)

Familiarity with infrastructure as code, AWS cloud platform

A knack for troubleshooting tough problems with a high level of ownership and curiosity to empower this skill

Ability and willingness to share on-call responsibilities

Preferred skills:

U.S. Citizenship with active DoD SECRET clearance and Counterintelligence Scope Polygraph

Current DoD Directive 8140.01 compliant certification (e.g. Security+ CE, SSCP, CASP+ CE, CISSP, CCNP Security)

Working knowledge of K8s, Docker, Helm and automated deployment via pipeline (e.g. Concourse or Jenkins)

Familiarity with distributed version control systems such as Git

Experience with Grafana or similar monitoring technologies

Experience with Root Cause Analysis (RCA)

Experience with Scaled Agile Framework

Willingness to step in as a leader to address ongoing incidents and problems and to provide guidance to others in order to drive to a resolution

Effectively prioritize work and encourage best practices in others

Meticulous and cautious with an ability to identify and consider all risks and balance those with performing the task efficiently

Understanding of Incident and Problem Management

Positive, flexible, and personable; adaptive to change

Good understanding of networking fundamentals

Have a "Make it happen" attitude

Organized with an ability to document and communicate ongoing work tasks and projects

Receptive to giving, receiving, and implementing feedback in a highly collaborative environment

Ability to learn rapidly in a fast-paced environment while being extremely curious about how things work

Maxar Technologies values diversity in the workplace and is an equal opportunity/affirmative action employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected veteran status, age, or any other characteristic protected by law.