Operational Runbook
This runbook provides step-by-step instructions for standard operations, maintenance, and incident response for the Tower Defense project.
1. Regular Maintenance
Log Monitoring
Check the logs of the API and Game Server regularly to identify potential issues before they escalate.
- Command:
pnpm nx run api:logs(or check your container logging platform). - What to look for:
ERRORorCRITICALlevels, high latency spikes, or frequent 5xx responses.
Database Backups
Ensure that automated backups are running correctly in your production environment.
- Verification: Check the last backup timestamp in your database management console.
- Test Restore: Periodically restore a backup to a staging environment to verify data integrity.
2. Standard Operations
System Restart (Graceful)
To restart the system without dropping active user sessions:
- Trigger a rolling restart of the API instances.
- Ensure the Database remains reachable during the process.
- Refresh the Frontend cache (if using a CDN).
Applying Database Migrations
Always run migrations during a maintenance window if they involve breaking changes.
- Command:
pnpm nx run db:push(for dev) or use a migration runner for prod. - Rollback: Keep a SQL rollback script ready for every schema change.
3. Incident Management
High Latency / Load
If the API response time exceeds 500ms:
- Check CPU and Memory usage of the API instances.
- Verify Database connection pool saturation.
- Scale out by adding more API instances if necessary.
Security Breach
If a security vulnerability is exploited:
- Isolate the affected instances.
- Revoke all active sessions via Better Auth.
- Rotate sensitive secrets (e.g.,
AUTH_SECRET,DATABASE_URL). - Follow the Security Policy for reporting and disclosure.
important
Every manual intervention should be documented in the incident log. For recurring issues, update the Troubleshooting guide.