Platform Engineer (GenAI)

🚀 Overview

Join TechX as we continue expanding our AI infrastructure team and delivering impactful GenAI-powered products for enterprise and industry clients.

We are looking for an experienced Platform Engineer to build and operate the core infrastructure that powers the safe, reliable, and efficient delivery of our GenAI solutions. This role is at the heart of how we scale AI applications in production environments — ensuring observability, automation, cost control, and compliance for our large language model (LLM) operations.

⚡ Note: This is not a prompt engineering or model tuning role. Instead, you will architect and manage the infrastructure that enables AI teams to operate Gemini Pro/Flash models at scale.

🎯 Key Responsibilities

✅ Own LLMs-Oriented Platform Architecture

Design platform components that abstract LLM's (eg, Gemini) APIs into a consistent, testable, and production-ready interface.

Handle retries, latency tracking, fallback switching, and configuration routing logic.

✅ Design Multi-Version Prompt Configuration Management

Manage prompt and parameter versions across deployments.

Track version statuses (active, canary, deprecated), maintain changelogs, and ensure rollback safety.

✅ Build Observability & Cost Intelligence for Gemini Usage

Define structured logs and metrics for Gemini interactions.

Monitor latency, feedback scores, token usage, and cost estimates.

Develop dashboards and alerts to catch performance regressions or anomalies.

✅ Enable Safe, Automated Rollbacks

Implement health scoring, statistical deviation logic, and automated rollback mechanisms.

Maintain robust audit logs, cooldown strategies, and “last known good” states.

✅ Secure Integration & Configuration Safety

Manage API keys and configuration securely using GCP-native tools (Secret Manager, IAM).

Enforce log redaction and PII masking.

Design version-aware deployment hooks and readiness checks.

🛠️ Key Requirements

Must-Have Skills

GCP + Gemini Integration: Proven experience integrating with Google Gemini APIs (Pro/Flash), with a deep understanding of request structures, cost models, latency behaviors, and operational best practices.
Python Engineering: Strong Python backend development skills, particularly with asynchronous frameworks like FastAPI or similar, capable of building robust and scalable backend services.
Observability Design: Expertise in designing structured logging and metrics for APIs, using formats like JSON or EMF, and implementing structured feedback tracking systems to ensure reliable monitoring and performance analysis.
Prompt and Configuration Versioning: Hands-on experience working with version-controlled configuration systems or registries, such as YAML or JSON-based setups, GitOps workflows, or similar, to manage prompt versions and deployment safety.
Automation and CLI Tooling: Ability to develop internal tooling and automation scripts (e.g., CLI tools for configuration management or rollback operations), including audit logging and safety mechanisms.
Security and Compliance: Familiarity with GCP Identity and Access Management (IAM), secure API key handling, log masking and redaction strategies for PII, configuration gating, and readiness for audit compliance in production environments.