Troubleshooting

Technical reference for diagnosing and resolving operational issues in Spooky HTTP/3 to HTTP/2 gateway deployments.

Configuration Errors

Invalid Configuration Schema

Error Messages:

Invalid version: expected '1', found '2'
Invalid protocol: expected 'http3', found 'http2'
Invalid log level: debug-verbose
Invalid load balancing type: 'weighted' for upstream 'api'

Root Causes: - Configuration schema version mismatch - Unsupported protocol specification - Invalid log level (valid: whisper, haunt, spooky, scream, poltergeist, silence, trace, debug, info, warn, error, off) - Unsupported load balancing algorithm (valid: random, round-robin, round_robin, rr, consistent-hash, consistent_hash, ch)

Diagnostic Commands:

# Validate YAML syntax
python3 -c "import yaml; yaml.safe_load(open('config.yaml'))" 2>&1

# Check configuration structure
grep -E "^version:|^listen:|^upstream:" config.yaml

# Verify log level
grep "log:" -A 2 config.yaml

# Check load balancing configuration
grep "load_balancing:" -A 2 config.yaml

Resolution: - Set version: 1 in configuration file - Use protocol: http3 for listen configuration - Correct log levels according to valid options - Update load balancing type to supported algorithms

Listen Configuration Errors

Error Messages:

Listen address is empty
Invalid listen port: 0 (must be between 1 and 65535)
Invalid listen port: 70000 (must be between 1 and 65535)
Failed to bind UDP socket

Root Causes: - Missing or empty listen address - Port number outside valid range (1-65535) - Port already in use by another process - Insufficient privileges for privileged ports (<1024)

Diagnostic Commands:

# Check port availability (UDP)
sudo ss -ulnp | grep :443

# Identify process using port
sudo lsof -i UDP:443

# Check socket permissions
sudo setcap -v 'cap_net_bind_service=+ep' /usr/local/bin/spooky

# Verify listen configuration
grep -A 5 "^listen:" config.yaml

Resolution:

# Grant capability for privileged port binding
sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/spooky

# Or bind to non-privileged port
sed -i 's/port: 443/port: 8443/' config.yaml

# Kill conflicting process
sudo fuser -k 443/udp

Upstream Pool Configuration Errors

Error Messages:

No upstreams configured
Upstream name is empty
Upstream 'api' has no backends configured
Upstream 'api' must have either 'host' or 'path_prefix' route matcher
Route path_prefix cannot be empty for upstream 'api'
Route path_prefix must start with '/' for upstream 'api': api/v1

Root Causes: - Empty upstream map in configuration - Missing route matching criteria (no host or path_prefix) - Invalid path prefix format (must start with /) - Empty backend list for upstream pool

Diagnostic Commands:

# List configured upstreams
grep "^upstream:" -A 50 config.yaml | grep -E "^  [a-z]"

# Check route configuration
yq '.upstream[].route' config.yaml

# Validate path prefixes
grep "path_prefix:" config.yaml

Resolution:

# Correct upstream configuration
upstream:
  api:
    route:
      host: "api.example.com"      # Host-based routing
      path_prefix: "/api"          # Must start with /
    load_balancing:
      type: round-robin
    backends:
      - id: backend1
        address: "10.0.1.10:8080"
        weight: 100

Backend Configuration Errors

Error Messages:

Backend ID is empty in upstream 'api'
Backend address is empty for backend 'backend1' in upstream 'api'
Backend address '10.0.1.10' in upstream 'api' must be in host:port format
Backend 'backend1' in upstream 'api' has invalid weight (0)
Health check interval is invalid (0) for backend 'backend1' in upstream 'api'
Health check timeout is invalid (0) for backend 'backend1' in upstream 'api'
Health check failure threshold is invalid (0) for backend 'backend1' in upstream 'api'
Health check success threshold is invalid (0) for backend 'backend1' in upstream 'api'
Health check cooldown is invalid (0) for backend 'backend1' in upstream 'api'

Root Causes: - Missing or malformed backend address (must be host:port) - Zero values for weight or health check parameters - Invalid health check configuration

Diagnostic Commands:

# Validate backend addresses
grep "address:" config.yaml | grep -v ":"

# Check health check configuration
yq '.upstream[].backends[].health_check' config.yaml

# Verify backend weights
yq '.upstream[].backends[].weight' config.yaml

Resolution:

# Correct backend configuration
backends:
  - id: backend1
    address: "10.0.1.10:8080"     # Must include port
    weight: 100                    # Must be > 0
    health_check:
      path: "/health"
      interval: 5000               # Must be > 0 (milliseconds)
      timeout_ms: 2000             # Must be > 0
      failure_threshold: 3         # Must be > 0
      success_threshold: 2         # Must be > 0
      cooldown_ms: 10000           # Must be > 0

TLS Certificate Problems

Certificate File Access Errors

Error Messages:

TLS certificate file does not exist: /etc/spooky/certs/server.crt
TLS private key file does not exist: /etc/spooky/certs/server.key
Cannot read TLS certificate file '/etc/spooky/certs/server.crt': Permission denied
Cannot read TLS private key file '/etc/spooky/certs/server.key': Permission denied
Failed to load certificate '/etc/spooky/certs/server.crt': No such file or directory
Failed to load key '/etc/spooky/certs/server.key': error:02001002:system library:fopen:No such file or directory

Root Causes: - Certificate or key file path does not exist - Insufficient file permissions for Spooky process - Invalid PEM format - File ownership prevents access

Diagnostic Commands:

# Verify file existence and permissions
ls -la /etc/spooky/certs/server.{crt,key}

# Check file ownership
stat /etc/spooky/certs/server.crt

# Test read access
sudo -u spooky cat /etc/spooky/certs/server.crt > /dev/null

# Validate PEM format
openssl x509 -in /etc/spooky/certs/server.crt -text -noout
openssl rsa -in /etc/spooky/certs/server.key -check -noout

Resolution:

# Fix file permissions
sudo chown spooky:spooky /etc/spooky/certs/server.{crt,key}
sudo chmod 644 /etc/spooky/certs/server.crt
sudo chmod 600 /etc/spooky/certs/server.key

# Verify certificate chain
openssl verify -CAfile ca.crt /etc/spooky/certs/server.crt

# Test certificate-key pair match
diff <(openssl x509 -in server.crt -noout -modulus | openssl md5) \
     <(openssl rsa -in server.key -noout -modulus | openssl md5)

TLS Handshake Failures

Error Messages:

TLS configuration error during request processing: handshake failure
Failed to load certificate: invalid certificate format
QUIC recv failed: TlsFail

Root Causes: - Certificate-key mismatch - Expired certificate - Incomplete certificate chain - Unsupported TLS version - Client does not support required ALPN protocols (h3, h3-29)

Diagnostic Commands:

# Check certificate expiration
openssl x509 -in server.crt -noout -dates

# Verify certificate chain
openssl s_client -connect localhost:443 -showcerts < /dev/null

# Check ALPN negotiation (requires curl with HTTP/3 support)
curl --http3 -v https://localhost:443 2>&1 | grep -i alpn

# Monitor TLS handshake packets
sudo tcpdump -i any -n udp port 443 -X | grep -A 20 "Initial"

Resolution:

# Regenerate certificate with proper SAN
openssl req -new -x509 -days 365 -key server.key -out server.crt \
  -subj "/CN=example.com" \
  -addext "subjectAltName=DNS:example.com,DNS:*.example.com"

# Ensure certificate chain is complete
cat server.crt intermediate.crt > fullchain.crt

# Update configuration
sed -i 's|cert: .*|cert: /etc/spooky/certs/fullchain.crt|' config.yaml

QUIC Connection Issues

Connection ID Mismatch

Error Messages:

Wrong QUIC HEADER
Non-Initial packet for unknown connection, ignoring
Dropping packet for unknown connection from 192.168.1.10:52341 (DCID: a3f2...)

Root Causes: - Client using stale connection ID after server restart - Connection ID collision or corruption - NAT rebinding without proper migration support - Packet reordering or duplication

Diagnostic Commands:

# Monitor connection IDs in logs
journalctl -u spooky -f | grep -E "DCID|SCID"

# Check active QUIC connections
ss -u -a | grep :443

# Capture QUIC packets for analysis
sudo tcpdump -i any -w quic.pcap udp port 443
tshark -r quic.pcap -Y quic

# Count connection errors
journalctl -u spooky --since "1 hour ago" | grep -c "Wrong QUIC HEADER"

Resolution: - Issue is typically transient; clients will establish new connections - Ensure set_disable_active_migration(true) is set in QUIC config - Check for network middleboxes modifying UDP payloads - Increase max_idle_timeout if connections drop prematurely

Version Negotiation Failures

Error Messages:

Version negotiation failed: buffer too short
Failed to send version negotiation: Network unreachable

Root Causes: - Client requesting unsupported QUIC version - MTU constraints preventing version negotiation packet transmission - Network path blocking UDP packets - Firewall stateful inspection interfering with QUIC

Diagnostic Commands:

# Check supported QUIC version
journalctl -u spooky | grep "PROTOCOL_VERSION"

# Test MTU path
tracepath -n -b 443 target-host

# Verify UDP egress
nc -u -v -w 1 target-host 443 < /dev/null

# Monitor version negotiation packets
sudo tcpdump -i any udp port 443 -v | grep -i version

Resolution:

# Configure smaller UDP payload size
# Edit config or quiche parameters:
# set_max_recv_udp_payload_size(1200)
# set_max_send_udp_payload_size(1200)

# Adjust firewall rules
sudo iptables -I INPUT -p udp --dport 443 -j ACCEPT
sudo iptables -I OUTPUT -p udp --sport 443 -j ACCEPT

QUIC Timeout and Idle Connections

Error Messages:

QUIC recv failed: Done
Connection closed, not storing

Root Causes: - Idle timeout exceeded (default 5000ms in Spooky) - Network path timeout - Client terminated connection without proper close - NAT binding expired

Diagnostic Commands:

# Check connection lifetimes
journalctl -u spooky | grep -E "Creating new connection|Connection closed" | tail -20

# Monitor timeout events
journalctl -u spooky -f | grep "on_timeout"

# Analyze connection duration distribution
journalctl -u spooky --since "1 hour ago" | \
  grep "Creating new connection" | wc -l

# Check NAT timeout settings (if behind NAT)
cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout

Resolution:

// Adjust idle timeout in quiche configuration
quic_config.set_max_idle_timeout(10000);  // Increase to 10 seconds

// Tune UDP stream limits
quic_config.set_initial_max_streams_bidi(200);
quic_config.set_initial_max_streams_uni(200);

Backend Connectivity Failures

Unknown Backend Errors

Error Messages:

No route found for path: /api/users (host: Some("api.example.com"))
Upstream pool not found for: api
unknown backend: 10.0.1.10:8080

Root Causes: - Request path/host does not match any configured upstream route - Upstream pool not properly initialized - Backend not registered in H2 connection pool - Route matching logic precedence issue

Diagnostic Commands:

# List configured routes
yq '.upstream[] | {route}' config.yaml

# Test route matching
journalctl -u spooky -f | grep "No route found"

# Verify H2 pool initialization
journalctl -u spooky --since "10 minutes ago" | grep -i "pool"

# Check backend registration
ss -t | grep :8080 | wc -l

Resolution:

# Ensure proper route specificity (longest prefix matching)
upstream:
  api_v2:
    route:
      host: "api.example.com"
      path_prefix: "/api/v2"   # More specific
    backends: [...]

  api_v1:
    route:
      host: "api.example.com"
      path_prefix: "/api"      # Less specific
    backends: [...]

  default:
    route:
      path_prefix: "/"         # Catch-all
    backends: [...]

HTTP/2 Connection Pool Errors

Error Messages:

Transport error: send: connection error detected: frame with invalid size
Transport error: send: connection closed
Transport error: body: stream error received: stream no longer needed
Backend timeout

Root Causes: - Backend closed HTTP/2 connection unexpectedly - H2 frame size violation - Backend service crashed or restarted - Connection pool exhaustion (>64 inflight requests per backend) - Network timeout (2s default in Spooky)

Diagnostic Commands:

# Monitor H2 connection errors
journalctl -u spooky -f | grep "Transport error"

# Check backend H2 support
curl -I --http2 http://10.0.1.10:8080/

# Test backend health endpoint
curl -v http://10.0.1.10:8080/health

# Monitor connection pool saturation
journalctl -u spooky | grep "semaphore closed"

# Check backend service status
systemctl status backend-service

Resolution:

# Increase backend timeout if needed
# Edit quic_listener.rs: BACKEND_TIMEOUT = Duration::from_secs(5);

# Adjust max inflight requests per backend
# Edit quic_listener.rs: MAX_INFLIGHT_PER_BACKEND = 128;

# Restart backend service
sudo systemctl restart backend-service

# Monitor backend connection states
watch -n 1 'ss -t -a | grep :8080'

Backend Health Check Failures

Error Messages:

Backend 10.0.1.10:8080 became unhealthy
Health checks disabled: no Tokio runtime available

Root Causes: - Backend failing health check endpoint - Health check timeout too aggressive - Backend intermittently unavailable - Network path to backend unreliable - Health check threshold too sensitive

Diagnostic Commands:

# Monitor health transitions
journalctl -u spooky -f | grep -E "became healthy|became unhealthy"

# Manual health check
curl -w "@-" -o /dev/null -s http://10.0.1.10:8080/health <<< \
  'time_total: %{time_total}s\nhttp_code: %{http_code}\n'

# Check health check configuration
yq '.upstream[].backends[].health_check' config.yaml

# Monitor backend response times
httping -c 10 http://10.0.1.10:8080/health

Resolution:

# Adjust health check parameters for stability
backends:
  - id: backend1
    address: "10.0.1.10:8080"
    health_check:
      path: "/health"
      interval: 10000           # Increase interval
      timeout_ms: 5000          # Increase timeout
      failure_threshold: 5      # Require more failures
      success_threshold: 2      # Require consecutive successes
      cooldown_ms: 30000        # Longer cooldown

Load Balancing Issues

No Healthy Backends Available

Error Messages:

no healthy servers
no servers configured for upstream

Root Causes: - All backends failed health checks - Empty backend list for upstream - Backends in cooldown period after failures - Circuit breaker triggered

Diagnostic Commands:

# Check backend health status
journalctl -u spooky | grep -E "became healthy|became unhealthy" | tail -20

# Monitor 503 Service Unavailable responses
journalctl -u spooky | grep "status 503"

# Count healthy vs total backends per upstream
yq '.upstream[].backends | length' config.yaml

# Check recent health transitions
journalctl -u spooky --since "5 minutes ago" | grep "Backend"

Resolution:

# Verify backend services are running
for backend in 10.0.1.10:8080 10.0.1.11:8080; do
  echo -n "$backend: "
  curl -s -o /dev/null -w "%{http_code}" http://$backend/health || echo "FAIL"
  echo
done

# Temporarily disable health checks for debugging
# Set failure_threshold very high in config.yaml

# Restart Spooky to reset health state
sudo systemctl restart spooky

Uneven Load Distribution

Symptoms: - One backend receives disproportionate traffic - Round-robin not cycling through backends - Consistent hash not distributing evenly

Root Causes: - Backend weight misconfiguration - Inconsistent hash key selection (always same key) - Some backends marked unhealthy - Hash ring replica count too low for consistent-hash

Diagnostic Commands:

# Analyze backend selection distribution
journalctl -u spooky | grep "Selected backend" | \
  awk '{print $(NF-2)}' | sort | uniq -c

# Check backend weights
yq '.upstream[].backends[] | "\(.id): \(.weight)"' config.yaml

# Monitor load balancing algorithm
journalctl -u spooky | grep "via round-robin\|via consistent-hash\|via random"

# Check hash key consistency
journalctl -u spooky | grep "request_hash_key"

Resolution:

# Ensure proper weight distribution
backends:
  - id: backend1
    address: "10.0.1.10:8080"
    weight: 100
  - id: backend2
    address: "10.0.1.11:8080"
    weight: 100    # Equal weight for even distribution

# For consistent-hash, increase replica count
# Edit lb/src/lib.rs: DEFAULT_REPLICAS = 128;

Performance Problems

High Latency

Symptoms: - latency_ms in logs consistently >1000ms - Slow response times reported by clients - Backend timeout errors (503 status)

Root Causes: - Backend processing delay - Network congestion - Connection pool saturation - CPU saturation on Spooky host - Inefficient load balancing

Diagnostic Commands:

# Analyze latency distribution
journalctl -u spooky --since "1 hour ago" | \
  grep "latency_ms" | \
  awk '{print $(NF)}' | \
  sort -n | \
  awk '{sum+=$1; arr[NR]=$1} END {
    print "min:", arr[1];
    print "p50:", arr[int(NR*0.5)];
    print "p95:", arr[int(NR*0.95)];
    print "p99:", arr[int(NR*0.99)];
    print "max:", arr[NR];
    print "avg:", sum/NR;
  }'

# Monitor CPU usage
top -b -n 1 | grep spooky

# Check connection pool contention
journalctl -u spooky | grep "semaphore" | tail -20

# Measure backend response time directly
time curl http://10.0.1.10:8080/api/test

Resolution:

# Increase backend timeout if backends are slow but reliable
# Edit BACKEND_TIMEOUT in quic_listener.rs

# Scale backend capacity
# Add more backends to upstream pool

# Increase connection pool size
# Edit MAX_INFLIGHT_PER_BACKEND in quic_listener.rs

# Optimize backend application
# Profile and optimize backend code

Memory Growth

Symptoms: - RSS memory continuously increasing - Out of memory errors - System swapping

Root Causes: - Connection leak (connections not properly closed) - Request/response body buffering - Metrics accumulation - QUIC connection state not cleaned up

Diagnostic Commands:

# Monitor memory usage over time
while true; do
  ps -p $(pgrep spooky) -o pid,vsz,rss,cmd | tail -1
  sleep 10
done

# Check connection count
ss -u | grep -c :443

# Analyze memory map
sudo pmap -x $(pgrep spooky)

# Check for file descriptor leaks
ls -l /proc/$(pgrep spooky)/fd | wc -l

Resolution:

# Restart Spooky periodically (temporary mitigation)
sudo systemctl restart spooky

# Monitor connection cleanup
journalctl -u spooky -f | grep "Connection closed"

# Reduce max idle timeout
# Edit quic_config.set_max_idle_timeout(3000);

# Limit connection count
# Implement connection limit in accept logic

UDP Packet Loss

Symptoms: - Retransmissions in QUIC logs - Client timeout errors - Degraded throughput

Root Causes: - Network congestion - UDP buffer overflow (receive buffer too small) - Firewall dropping packets - MTU fragmentation

Diagnostic Commands:

# Check UDP buffer sizes
sysctl net.core.rmem_max net.core.rmem_default
sysctl net.core.wmem_max net.core.wmem_default

# Monitor UDP statistics
netstat -su | grep -E "packet receive errors|receive buffer errors"

# Capture packet loss
sudo tcpdump -i any -c 1000 udp port 443 -w capture.pcap
tshark -r capture.pcap -q -z io,stat,1

# Check interface statistics
ip -s link show eth0

Resolution:

# Increase UDP buffer sizes
sudo sysctl -w net.core.rmem_max=26214400
sudo sysctl -w net.core.wmem_max=26214400
sudo sysctl -w net.core.rmem_default=26214400
sudo sysctl -w net.core.wmem_default=26214400

# Make permanent
echo "net.core.rmem_max=26214400" | sudo tee -a /etc/sysctl.conf
echo "net.core.wmem_max=26214400" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Reduce UDP payload size
# Edit quic_config.set_max_recv_udp_payload_size(1350);

Debugging Techniques

Enable Debug Logging

# config.yaml
log:
  level: haunt  # debug level

# Restart to apply
sudo systemctl restart spooky

# Monitor debug logs
journalctl -u spooky -f --output=cat

Analyze Request Flow

# Trace specific request path
journalctl -u spooky | grep -E "HTTP/3 request|Selected backend|Upstream.*status" | \
  grep "/api/users"

# Monitor complete request lifecycle
journalctl -u spooky -f | \
  grep -E "Creating new connection|HTTP/3 request|Selected backend|status.*latency_ms|Connection closed"

Packet Capture Analysis

# Capture QUIC traffic
sudo tcpdump -i any -w spooky.pcap udp port 443

# Analyze with tshark
tshark -r spooky.pcap -Y quic -T fields \
  -e frame.time -e ip.src -e ip.dst -e quic.header_form

# Decrypt QUIC (requires SSLKEYLOGFILE)
SSLKEYLOGFILE=/tmp/keys.log curl --http3 https://localhost:443/
tshark -r spooky.pcap -o tls.keylog_file:/tmp/keys.log -Y http3

Performance Profiling

# CPU profiling with perf
sudo perf record -F 99 -p $(pgrep spooky) -g -- sleep 30
sudo perf report --stdio | head -50

# Flamegraph generation
sudo perf record -F 99 -p $(pgrep spooky) -g -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

# Memory profiling (if built with jemalloc)
export MALLOC_CONF=prof:true,prof_prefix:/tmp/jeprof
sudo systemctl restart spooky
# Send traffic, then analyze with jeprof

Common Error Reference

Error Message	HTTP Status	Cause	Resolution
`invalid request`	400	Malformed HTTP/3 headers	Check client request format
`no servers configured for upstream`	503	Empty backend list	Add backends to upstream config
`no healthy servers`	503	All backends unhealthy	Check backend health, adjust thresholds
`invalid server`	503	Backend index out of bounds	Configuration reload race condition
`upstream error`	502	Backend connection failed	Verify backend connectivity
`upstream timeout`	503	Backend exceeded 2s timeout	Increase BACKEND_TIMEOUT or optimize backend
`internal server error`	500	TLS configuration error	Check certificate/key files
`Wrong QUIC HEADER`	(dropped)	Malformed QUIC packet	Check for network corruption
`No route found for path`	(internal)	No matching upstream route	Add route configuration
`Upstream pool not found`	(internal)	Pool initialization failure	Check logs for startup errors

Support and Escalation

When reporting issues, include:

# 1. Version information
spooky --version

# 2. Configuration (sanitized)
yq eval 'del(.listen.tls.key, .listen.tls.cert)' config.yaml

# 3. System information
uname -a
cat /etc/os-release

# 4. Error logs (last 100 lines)
journalctl -u spooky --no-pager -n 100 --since "1 hour ago"

# 5. Resource utilization
ps aux | grep spooky
ss -u | grep -c :443
free -h

# 6. Network diagnostics
ss -ulnp | grep spooky
sudo iptables -L -n -v | grep 443

For production incidents, capture diagnostic bundle:

#!/bin/bash
mkdir -p spooky-diagnostics
cd spooky-diagnostics

spooky --version > version.txt
uname -a > system.txt
journalctl -u spooky --no-pager -n 500 > logs.txt
yq eval 'del(.listen.tls.key, .listen.tls.cert)' ../config.yaml > config.yaml
ps aux | grep spooky > processes.txt
ss -tulnp > sockets.txt
free -h > memory.txt
sudo tcpdump -i any -c 100 -w capture.pcap udp port 443

cd ..
tar czf spooky-diagnostics-$(date +%Y%m%d-%H%M%S).tar.gz spooky-diagnostics/