🚀 Overview
Join TechX as we continue expanding our AI infrastructure team and delivering impactful GenAI-powered products for enterprise and industry clients.
We are looking for an experienced Platform Engineer to build and operate the core infrastructure that powers the safe, reliable, and efficient delivery of our GenAI solutions. This role is at the heart of how we scale AI applications in production environments — ensuring observability, automation, cost control, and compliance for our large language model (LLM) operations.
⚡ Note: This is not a prompt engineering or model tuning role. Instead, you will architect and manage the infrastructure that enables AI teams to operate Gemini Pro/Flash models at scale.
Design platform components that abstract LLM's (eg, Gemini) APIs into a consistent, testable, and production-ready interface.
Handle retries, latency tracking, fallback switching, and configuration routing logic.
Manage prompt and parameter versions across deployments.
Track version statuses (active, canary, deprecated), maintain changelogs, and ensure rollback safety.
Define structured logs and metrics for Gemini interactions.
Monitor latency, feedback scores, token usage, and cost estimates.
Develop dashboards and alerts to catch performance regressions or anomalies.
Implement health scoring, statistical deviation logic, and automated rollback mechanisms.
Maintain robust audit logs, cooldown strategies, and “last known good” states.
Manage API keys and configuration securely using GCP-native tools (Secret Manager, IAM).
Enforce log redaction and PII masking.
Design version-aware deployment hooks and readiness checks.
Experience working with OpenAI, Claude, or AWS Bedrock (in addition to Gemini).
Experience designing model abstraction layers or runtime LLM routing.
Exposure to token cost modeling or billing/reporting APIs for LLMs.
Familiarity with AI security best practices in cloud environments.
Work closely with Prompt Engineers to monitor version health and feedback.
Partner with AI Architects to optimize Gemini performance and integration.
Coordinate with Product & Operations for cost reporting, SLAs, and system health.
Engage with the DevOps (AWS) Team for hybrid observability and CI/CD processes.
4–6+ years in backend engineering, platform engineering, or SRE roles.
Prior experience deploying and monitoring AI/ML workloads (GCP preferred; multi-cloud a plus).
Bonus: Direct hands-on usage of Gemini APIs or managing LLM configurations in production.
Take ownership of Gemini observability and integration at scale.
Lead the GCP / Gemini-first strategy while collaborating across hybrid cloud environments.
Be part of a forward-thinking team, building mission-critical GenAI platforms for regulated industries.
Competitive salary, modern engineering culture, and career growth opportunities.