Description

Actuellement Staff Engineer chez Aiven et ex-SRE chez Datadog et Criteo, je suis spécialisé dans la scalabilité et la résilience des infrastructures de données à très haut débit. J'accompagne les CTOs et équipes d'ingénierie pour débloquer leurs défis d'architecture distribuée. J'interviens à temps partiel (soir, week-end, asynchrone).

🎯 Mes services

• Audit d'Architecture & Diagnostic : Analyse de votre architecture, pratiques et infrastructures. Livraison d'un plan d'action (Target Architecture, Quick-wins) pour fiabiliser et passer à l'échelle.

• Advisory / Fractional Staff : Accompagnement asynchrone régulier. Revue de vos RFCs, validation de choix technologiques, mentorat et soutien aux décisions complexes.

📊Track Record

• Aiven : Ré-architecture d'un orchestrateur de 2000 clusters Kafka (alertes divisées par 5 sur 100+ régions). Création d'un pipeline de monitoring et billing on demand (eBPF/Vector/ClickHouse) traitant le trafic de 150k+ serveurs.

• Datadog : Déploiement de 200+ clusters Kafka/Cassandra. Création du framework Kubernetes interne (CDK8s), remplaçant Helm, et migration sans downtime toute l'infrastructure statefull sans downtime.

• Criteo : Réplication Kafka inter-DC (pétaoctets/jour avec zéro downtime). Industrialisation de 40k serveurs via Chef pour des systèmes multi-tenant d’orchestration de containers. Design et implémentation d’un système d’auto-diagnostique de crash de containers.

🛠 STACK TECHNIQUE

• Kafka, Kubernetes (CDK8s, Helm), Terraform, Datadog, Prometheus, AWS, GCP.

• Go, Python, Java, Bash.

• Reliability, Production Readiness, SLI/SLO, Capacity Planning, Chaos Engineering.

Languages

French
Native or bilingual
English
Fluent

Workplace preferences

Remote only

Primarily works remotely

Aiven
Staff Software Engineer
March 2022 - Today (4 years and 3 months)
Lyon, France
Technical Leadership & Cross-Org Influence
- Acted as Technical advisor for the streaming organization, aligning engineering execution with product and business OKRs. Unblocked stalled initiatives, guided teams through complex delivery challenges, and provided architectural direction on company-wide initiatives. Partnered with leadership to translate product strategy into actionable, distributed-systems-ready technical goals.

Scalable Billing & High-Throughput Data Systems
- Drove the design and rollout of a pay-as-you-consume billing platform, reducing revenue leakage. Built a multi-cloud network monitoring pipeline (Vector, Kafka, ClickHouse) classifying traffic of 100k+ servers with 60-second resolution and sustaining millions of daily events.

Site Reliability & Operational Resilience
- Redefined the SRE operating model by introducing a domain-oriented SME structure, improving parallelism, reducing context switching, and strengthening collaboration across product and infrastructure.

Performance Optimization, Resiliency & Cost Efficiency
- Engineered optimizations across internal platforms and customer-facing workloads:
Re-architected Kafka-backed scheduling system, cutting alerts by 5× across 100+ regions, minimizing downtime for thousands of clusters.

- Doubled network throughput for customer workloads by exploiting AWS EBS internals with LVM, unlocking performance gains without additional cost.

Codebase Modernization & Velocity
- Re-designed core data placement logic into a modular, testable architecture. Increased coverage 3×, accelerated feature delivery, and enabled faster onboarding for new engineers, improving team velocity and ownership of critical systems.

Core Stack & Practices: Python, Kafka, Prometheus, Grafana, Zookeeper, ClickHouse, Vector, OpenSearch, AWS, GCP, Distributed systems, Observability, On-call, Incident Management, SLI/SLO, Performance tuning.
Team Leadership Apache Kafka Python Architecture logicielle Distributed Architecture
Datadog
Site Reliability Engineer
DIGITAL AND IT
January 2020 - March 2022 (2 years and 2 months)
Lyon, France
Built a new Kubernetes framework with CDK8s and created tooling that enabled a smooth migration from Helm with zero downtime; adopted company-wide to manage hundreds of services.

Cut infrastructure deployment time from 3–4 hours to under 10 minutes by rolling out Terraform-based continuous deployment integrated with GitLab CI.

Migrated 40+ applications from a global Chef-managed Kafka cluster to dedicated, Kubernetes-hosted Kafka clusters with no downtime, using custom mirroring strategies tailored to business SLAs.

Deployed and managed 200+ Kafka, Cassandra, and Postgres clusters across four global datacenters, all orchestrated via Kubernetes to improve reliability and consistency.

Standardized deployment and coding practices across dozens of services by introducing shared libraries, reducing configuration drift and simplifying maintenance.

Implemented authentication for all Go and Python services interacting with Kafka and Cassandra, closing critical security gaps across the platform.

Automated 8recurring team operations using a Temporal-based workflow engine, freeing engineers to focus on higher-value projects.

Established the first embedded SRE model in the Alerting platform, scaling reliability practices across teams and improving on-call outcomes.

Provided expertise in observability, capacity planning, and deployment strategies, shaping reliability culture within the Alerting org and reducing incidents linked to misconfigured rollouts.
Observability Site Reliability Engineering Golang Apache Kafka Gestion des incidents
Criteo
Site Reliability Engineer
DIGITAL AND IT
March 2018 - January 2020 (1 year and 10 months)
Paris, France
Implementation of an in house solution for the replication cross datacenters of Kafka clusters, handling Petabytes of data per day.

Design algorithms based on Spark evaluating the storage cost of field's in data schemas. Allowing our team to reduce by 15% our biggest dataset.

Deploying dozens of Kafka clusters with Chef, hundreds of topics and thousands of Protobuf schemas on a daily basis while ensuring no downtime.

Automation and Industrialization of an Orchestrated platform with Mesos and Consul.

Configuration management for ~40k physical machines in 8 DCs with Chef.

Reducing “time-to-diagnose” workload by developing a service autonomously diagnosing application and infrastructure failures, based on Prometheus, ES and Grafana.

Worked in depth on resource isolation ( mainly CPU and Network ) to improve fairness, efficiency and reduce noisy neighbor issue on Criteo platform.

Tuning Linux kernel parameters and mechanisms to improve performances for critical latency-sensitive applications.
Docker Site Reliability Engineering Automatisation Monitoring Apache Kafka