Site Reliability Engineer
LotusFlare is a provider of cloud-native SaaS products based in the heart of Silicon Valley. Founded by the team that helped Facebook reach over one billion users, LotusFlare was founded to make affordable mobile communications available to everyone on Earth.
Today, LotusFlare focuses on designing, building, and continuously evolving a digital commerce and monetization platform that delivers valuable outcomes for enterprises. Our platform, Digital Network Operator® (DNO™) Cloud, is licensed to telecommunications services providers and supports millions of customers globally.
LotusFlare has also designed and built the leading eSIM travel product - Nomad. Nomad provides global travelers with high-speed, affordable data connectivity in over 190 countries. Nomad is available as an iOS or Android app or via getnomad.app.
Job Description and Responsibilities:
We are looking for a Reliability Engineer to serve as the diagnostic backbone of our engineering organization. As platform traffic continues to scale rapidly, this role will focus on building and evolving a robust Observability Pipeline that goes far beyond basic monitoring. You will work at the intersection of infrastructure and application code, ensuring data-driven clarity during system degradation and incidents.
Key Responsibilities:
Operational Empathy & Developer Enablement
Partner closely with feature and product engineers to embed observability into the development lifecycle. Translate complex logs and telemetry into clear, actionable Grafana dashboards that help teams understand system behavior and blast radius.Full-Stack Observability & Forensics
Lead distributed tracing initiatives by correlating frontend exceptions (Sentry) with backend logs and traces (VictoriaLogs/OpenSearch), enabling a seamless end-to-end (“north-to-south”) view of system health.Telemetry Gap Identification & Instrumentation
Proactively identify blind spots in logging, metrics, and traces. Implement custom instrumentation across service layers to capture high-cardinality data while maintaining system performance.Incident Response Automation
Design and build Python-based automation tools to reduce “Time to Truth” during incidents by automating log aggregation, telemetry correlation, and diagnostic reporting.Service Level Ownership
Advocate for meaningful reliability metrics by defining and refining SLIs and SLOs that truly reflect user experience and satisfaction, balancing rapid feature delivery with long-term system stability.
Job Requirements:
Technical Skills & Experience:
Expert-level experience with OpenSearch and VictoriaLogs, including indexing strategies and optimization for high-volume log ingestion and querying.
Strong hands-on expertise with Grafana, building intuitive dashboards that clearly communicate system behavior and incident patterns.
Python: Advanced proficiency for building automation scripts, diagnostic tooling, and observability “glue code.”
TypeScript & Lua: Working familiarity required. You should be comfortable reading and understanding service codebases to trace request flows end-to-end (deep expertise not required on day one).
Experience with Sentry, including performance monitoring and profiling capabilities to proactively identify regressions and bottlenecks.
Additional Expectations:
Strong analytical and problem-solving mindset with a passion for system reliability and visibility
Ability to collaborate effectively with application engineers, platform teams, and incident responders
Clear communication skills to translate complex system data into actionable insights for diverse stakeholders
Benefits we have for you:
- Competitive salary package
- Paid lunch (In the office)
- Training and workshops
- Top-of-the-class engineers to learn from and work with
About us:
At LotusFlare, we attract and keep amazing people by offering two key things:
- Purposeful Work: Every team member sees how their efforts make a tangible, positive difference for our customers and partners.
- Growth Opportunities: We provide the chance to develop professionally while mastering cutting-edge practices in cloud-native enterprise software.
From the beginning, our mission has been to simplify technology to create better experiences for customers. Using an “experience down” approach, which prioritizes the customer's journey at every stage of development, our Digital Network Operator™ Cloud empowers communication service providers to achieve valuable business outcomes. DNO Cloud enables communication service providers to innovate freely, reduce operational costs, monetize network assets, engage customers on all digital channels, drive customer acquisition, and increase retention.
With headquarters in Santa Clara, California, and five major offices worldwide, LotusFlare serves Deutsche Telekom, T-Mobile, A1, Globe Telecom, Liberty Latin America, Singtel, and other leading enterprises around the world.
Website: www.lotusflare.com
LinkedIn: https://www.linkedin.com/company/lotusflare
Instagram: https://www.instagram.com/lifeatlotusflare/
X: https://twitter.com/lotus_flare
- Department
- Server Engineering
- Role
- Site Reliability Architect
- Locations
- Pune, India