I want EWE to train a Drone Pilot

DroneDetect

Curriculum-based reinforcement learning for autonomous shepherd drones — with a live multiplayer demo.

Unity 6 · ML-Agents 4 PPO + GAIL Standalone Multiplayer OAuth2 PKCE

Benedikt Linn · Code University Berlin · 2026-05-18

Problem

  • The drone acts as an active protector of a sheep flock
  • Threat: wolves — either rule-based (sprint + target lock) or human-controlled (multiplayer)
  • Defensive action: a scarer — light cone + sound that builds wolf fear and triggers a panic flight
  • The drone carries 3 HP; colliding with a wolf costs a life (1s i-frames)
  • Constraints: 3-minute demo round · 5-minute training episode

→ Classic multi-agent / asymmetric-roles setup with physically realistic drone dynamics (AR-Drone 2.0 digital twin).

Pipeline Architecture

Phase 1
DroneCSI
flight foundation
Phase 4
DroneShepherd
shepherding
Standalone Deploy
app.linn.games/shepherd
PhaseAlgorithmCurriculumDemo mode
1: DroneCSIPPO5 levels (parcours_difficulty + wind_strength)
4: DroneShepherdPPO + GAILInit from DroneCSI.onnx (transfer)Human demos (4 Hz recording)

Observation & Action Space

14 observations · 4 continuous actions

IndexObservationRange
0Altitude / 10[0,1]
1–3Velocity (vx, vy, vz) / 5[-1,1]
4–6Euler rotation / 180[-1,1]
7–9Angular velocity / 5[-1,1]
10–12Direction-to-target (local, normalized)unit
13Distance-to-target / 50[0,1]

Identical structure across both phases → DroneCSI.onnx is a drop-in curriculum start for phase 4.

Phase 1: DroneCSI — PPO Setup


behaviors:
  DroneCSI:
    trainer_type: ppo
    hyperparameters:
      batch_size: 512        # mini-batch per PPO update
      buffer_size: 10240     # full rollout before update (20 × batch_size)
      learning_rate: 3.0e-4
      beta: 5.0e-3           # entropy bonus → exploration
      epsilon: 0.2           # PPO clipping
      learning_rate_schedule: linear
    network_settings:
      hidden_units: 256
      num_layers: 3          # deeper MLP → richer dynamics representation
      normalize: true
    max_steps: 1_000_000
    time_horizon: 256        # length of return bootstrap
  

Phase 1: Curriculum Progression

Two orthogonal difficulty axes, both reward-triggered:

LessonParcoursWindReward threshold
0easycalm1.5 → next
1medium0.32.5 → next
2hard0.63.5 → next
3complex0.855.0 → next
4 ✓fullfull stormterminal

→ All 5 levels cleared in 150k steps · ~3 h wall time on BigOne (RTX 4090).

Phase 1: Training Results

Cumulative reward
35.25
↑ +39.5 vs. random
Episode length
297
↑ from 158 (88% longer)
Curriculum level
4 / 4
✓ all reached

Phase 4: Reward Shaping

Spec: "The more fear in the wolves, the more reward. Each surviving sheep is a multiplier. 5-minute episode. Penalty for sheep killed."

SignalValueTrigger
Δ wolf fear (per step)+4 × dFearWolf accumulates fear inside the scarer cone
Wolf panic event+3.0Once per panic trigger
Sheep alive+0.0003 × N_sheepEvery FixedUpdate
Sheep killed−3.0Immediately on catch
Drone crash (emergency)−1.0EndEpisode
Drone destroyed (3 HP lost)−1.0 + EndRound3× wolf collision (1 s i-frames)
Episode end: sheep saved+5 × N_survivedEndRound()
Perfect defense× 1.5 multiplierAll sheep survived
Fast-win bonus+0.05 / remaining secondOnly on perfect defense

Reward-Hacking Prevention

Risk: the agent could learn to maximise "keep wolves afraid" via overly aggressive play — sacrificing sheep as collateral.

Mitigation:

  • survivalRatio = sheepSaved / initialSheepCount
  • Episode-end bonus is scaled by Lerp(0.2, 1.0, survivalRatio)
  • → losing every sheep: 80% penalty on the end-bonus even if fear was high
  • → saving all sheep: full reward + 1.5× multiplier

// Survival-ratio shaping: pushes agent away from
// "trade sheep for fear-spam" exploits.
AddReward(endBonus * Mathf.Lerp(0.2f, 1f, survivalRatio));
EndEpisode();
  

Phase 4: GAIL Imitation Learning

Pure reward shaping leaves exploration too slow → we inject human demonstrations.


reward_signals:
  extrinsic:
    gamma: 0.99
    strength: 1.0
  gail:
    strength: 0.5             # weighted against extrinsic
    gamma: 0.99
    demo_path: Assets/Demonstrations/DroneSessions/
    use_actions: true         # GAIL discriminates on (state, action)
    use_vail: false
  
  • Demo recording: via DemonstrationRecorder, 4 Hz position logging
  • Replay pipeline: existing multiplayer sessions → GAIL demos via the app.linn.games shepherd.sessions API
  • Bootstrap: DroneCSI.onnx as initial policy (transfer)

Gameplay Layer 2026-05-18

Reward signals need a working game loop. Today's iteration:

FeatureImpl.Why
Downward light-cone scarerSpotLight 90° + SphereCollider triggerVisual clarity + reliable fear transmission
Wolf fear for AI botsWolfFear.AddFear() guard removedDemo ran silent before — bots now react to the scarer
Drone HP system3 HP · 1 s i-frames · round-end signalBefore: drone went silent on death, round kept ticking
Wolf hunting driveSpeed 7 + 1.5× sprint < 10 m · target lock + stuck timerPre-fix: wolf zig-zagged, never caught a sheep
End-screen outcomes"Drone destroyed – wolf wins" vs. "All sheep saved"Clear demo UX instead of a silent timeout

Deploy Topology: Standalone (Pivot)

⚠ WebGL dropped: Unity 6 + URP = 72 GB shader variants, 3× build failures. New model: standalone clients.

BigOne · Training & Build

  • RTX 4090 + 64 GB RAM
  • Unity 6 + ML-Agents training
  • Linux standalone build (IL2CPP)
  • Windows + macOS via local Mono cross-compile

u-server · Production

  • app.linn.games (Laravel 12 + Filament)
  • OAuth2 PKCE via Passport 13 (NEW)
  • Reverb WebSocket for multiplayer sessions
  • Event recording (4 Hz position tracks)
  • ZIP download via /api/shepherd/builds

No Docker Swarm needed · no GPU on u-server · training on BigOne; PPO 1M steps also fits on an 8 GB VGPU.

Multiplayer Login 2026-05-18 fix

A user-friendly login — no one needs to copy-paste CLI tokens:

  1. Unity standalone button "Login with app.linn.games" → opens the system browser
  2. Browser lands on /oauth/authorize?response_type=code&client_id=…&code_challenge=… (PKCE / RFC 7636)
  3. If not signed in → 302 → /login, Fortify-2FA capable
  4. After login → consent card "Approve / Deny"
  5. Redirect to http://127.0.0.1:51742/callback?code=… via loopback (RFC 8252)
  6. Unity exchanges the code for an access token, stores it in PlayerPrefs
  7. MatchmakeManager flips to online mode; RevbClient connects to shepherd.{code}

→ This was a 500 error this morning — two layers fixed: the php-cli storage volume mount was missing (passport:keys vanished), and Passport 13 ships no default consent view (inline closure registered).

Live Downloads

Three platform builds currently in production, all from the shepherd-v0.3.0 release:

🐧 Linux x64
371 MB · IL2CPP
download .zip
🪟 Windows x64
60 MB · Mono
download .zip
🍎 macOS arm64
62 MB · Mono
download .zip

Auto-sync via gh workflow run "Shepherd · Upload builds from release" — new versions land on prod within a minute. JSON API: /api/shepherd/builds

Live-Demo Walkthrough

  1. Grab a standalone client (see previous slide)
  2. Click "Login with app.linn.games" → system browser → JWT stored in PlayerPrefs
  3. Pick a role: wolf 🐺 or drone 🚁
  4. Session code from backend / auto-join open session — OFFLINE mode for solo play
  5. 3-minute demo round (180 s) or 5-minute training round (300 s)
  6. HUD: timer · sheep status · wolf-fear bar · drone HP
  7. Round end: "Drone destroyed" / "Wolf wins" / "All sheep saved" + automatic .demo upload

Drone runs DroneCSI.onnx inference in demo mode; keyboard heuristic captures GAIL demos in training mode.

Status & Roadmap

Done

Phase 1: DroneCSI 150k steps, 35.25 reward Phase 4: reward shaping + anti-exploit GAIL demo-upload API + DemoUploader.cs Standalone build pipeline (Linux/Win/Mac) Demo sync for training (JWT pull) Multiplayer session management (Reverb WebSocket) OAuth2 PKCE login (Passport 13) today Gameplay layer: HP, scarer cone, wolf-hunting AI today Auto-deploy via shepherd-upload-builds workflow

In Progress

Phase 4 training run (1.5M steps planned) Multiplayer PvP tournament mode

Next

Recurrent net + LSTM for longer-horizon strategies Integration with a real drone SDK (DJI / Autel) SDF heatmap in the dashboard (where the AI defends well/poorly)

Q & A

app.linn.games/shepherd  ·  Builds API

github.com/nileneb/DroneDetect  ·  github.com/nileneb/app.linn.games