Software Developer - Backend, Platform & Integrations
Simbian.AIBuilding core platform infrastructure for a multi-tenant agentic SOC product — agent harness, data service layer, observability, and enterprise auth.
- Agent harness — Building the cybersecurity agent harness: permissions, skills, context management, and tool orchestration that coordinate the agent's reasoning with the data service layer. Handles LLM provider onboarding, quota, and rate limiting.
- Data service layer — Built the vendor-agnostic data layer that is the agent's hands and eyes: alert ingestion across 20+ integrations (CrowdStrike, Splunk, Palo Alto), entity enrichment, log search, and response actions across the full investigation lifecycle. Customer-extensible via MCP, with per-integration failure handling, retry, and rate limiting.
- Integration observability — Integrations are the critical path of every investigation, yet failures used to collapse into opaque errors. Surfaced first-class exceptions and a standardised error taxonomy shared across all integrations, so customers see exactly why something failed and oncall, notifications, and alerting run off one consistent signal.
- Platform observability — OpenSearch and Grafana-on-Prometheus give internal teams a live read across the whole system, spanning LLM and investigation failures, API latency, and resource health, with alerts escalating to oncall through Teams and Zenduty. On top, layered customer-facing analytics on Apache Doris for per-customer insight into investigations, agents, and integrations.
- Agent observability — Instrumented the agent with Langfuse exporting traces to ClickHouse (analytics migrating to Apache Doris); built dashboards for cost, token usage, and prompt analytics.
- Enterprise auth & access control — Shipped SAML SSO, email OTP login, hierarchical RBAC with customisable roles, AuditLogs with sensitive-data redaction, and real-time i18n middleware (AWS Translate).
- Database reliability — Cut Postgres CPU load by 30% and improved p95/p99 latency ~20x by diagnosing a misordered composite index through production query-plan analysis.
- User comms & case management — Built case management across Teams, Jira, and Slack: auditable ticket-based cases, reminders, escalations, action triggers, and inline user context gathering.
PythonDjangoPostgreSQLTypescriptReactApache DorisLangfuseClickhouseKafkaOpenTelemetryOpenSearchPGVector