Alerts & Notifications
Alert System Architecture
Creating Alert Rules
Navigate to Project → Alerts → New Alert Rule. Select metric, choose condition, set threshold, select severity (Low/Medium/High/Critical), configure notification channels (Email/Slack/Teams), set cooldown period to prevent spam, and enable auto-resolve when conditions clear. Test alerts before activating for production use.
Pre-Built Alert Templates
Alert Template Library
Cost Management Templates
1. High Request Cost
- Triggers:
cost > $0.50 - Severity: High
- Channels: Email, Slack
- Use Case: Detect expensive requests that exceed budget
2. Daily Budget Exceeded
- Triggers:
daily_spend > $100 - Severity: Critical
- Channels: Email, Slack, Teams
- Use Case: Prevent runaway costs
3. Cost Spike Detection
- Triggers:
cost > 3x rolling average - Severity: Medium
- Channels: Email
- Use Case: Early warning for cost anomalies
Performance Templates
1. High Latency
- Triggers:
task_e2e_latency > 10000ms - Severity: High
- Channels: Slack, PagerDuty
- Use Case: Ensure responsive user experience
2. Low Success Rate
- Triggers:
task_success_rate < 0.95 - Severity: Critical
- Channels: All
- Use Case: Detect system failures
3. High Error Rate
- Triggers:
error_count > 50/hour - Severity: High
- Channels: Email, Slack
- Use Case: Monitor system health
Content Quality Templates
1. Content Drift
- Triggers:
content_drift > 0.7 - Severity: Medium
- Channels: Email
- Use Case: Detect model behavior changes
2. Low Readability
- Triggers:
readability_score < 30 - Severity: Low
- Channels: Email
- Use Case: Maintain content quality standards
3. Negative Sentiment
- Triggers:
sentiment_score < -0.5 - Severity: Medium
- Channels: Email, Slack
- Use Case: Monitor customer satisfaction
Security & Compliance Templates
1. PII Output Leakage
- Triggers:
pii_output_total > 0 - Severity: Critical
- Channels: All + Incident Management
- Use Case: Prevent data breaches
2. Excessive PII Input
- Triggers:
pii_input_total > 5 - Severity: Medium
- Channels: Email, Slack
- Use Case: Monitor PII exposure
3. Robustness Test Failure
- Triggers:
robustness_score < 0.8 - Severity: High
- Channels: Email, Slack
- Use Case: Ensure model safety
Alert Automation Examples
1. Auto-Scaling Based on Alerts
# webhook_handler.py
from flask import Flask, request
import boto3
app = Flask(__name__)
ec2 = boto3.client('ec2')
@app.route('/webhook/noah-alerts', methods=['POST'])
def handle_noah_alert():
alert = request.json
# High latency detected - scale up
if alert['metric'] == 'task_e2e_latency' and alert['severity'] == 'high':
current_instances = get_instance_count()
if current_instances < MAX_INSTANCES:
scale_up_instances(current_instances + 2)
return {
'action': 'scaled_up',
'from': current_instances,
'to': current_instances + 2
}
# Cost exceeded - scale down to cheaper model
if alert['metric'] == 'cost' and alert['value'] > 0.50:
switch_to_model('gpt-3.5-turbo')
return {
'action': 'switched_model',
'from': 'gpt-4',
'to': 'gpt-3.5-turbo'
}
return {'action': 'no_action_required'}
def get_instance_count():
# Implementation
pass
def scale_up_instances(count):
# Implementation
pass
def switch_to_model(model_name):
# Implementation
pass
2. Incident Management Integration
# pagerduty_integration.py
import requests
import os
PAGERDUTY_API_KEY = os.getenv('PAGERDUTY_API_KEY')
PAGERDUTY_SERVICE_ID = os.getenv('PAGERDUTY_SERVICE_ID')
def create_incident(alert):
"""Create PagerDuty incident from Noah alert"""
# Only create incidents for critical alerts
if alert['severity'] != 'critical':
return None
payload = {
'incident': {
'type': 'incident',
'title': f"Noah Alert: {alert['title']}",
'service': {
'id': PAGERDUTY_SERVICE_ID,
'type': 'service_reference'
},
'urgency': 'high',
'body': {
'type': 'incident_body',
'details': format_alert_details(alert)
}
}
}
response = requests.post(
'https://api.pagerduty.com/incidents',
json=payload,
headers={
'Authorization': f'Token token={PAGERDUTY_API_KEY}',
'Content-Type': 'application/json',
'Accept': 'application/vnd.pagerduty+json;version=2'
}
)
return response.json()
def format_alert_details(alert):
return f"""
Project: {alert['project_name']}
Metric: {alert['metric']}
Current Value: {alert['value']}
Threshold: {alert['threshold']}
Triggered: {alert['triggered_at']}
View in Noah: {alert['dashboard_url']}
"""
3. Slack Bot Commands
# slack_bot.py
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
client = WebClient(token=os.getenv('SLACK_BOT_TOKEN'))
def handle_slash_command(command, user_id, channel_id):
"""Handle /noah commands in Slack"""
if command == '/noah alerts':
# Get active alerts
alerts = get_active_alerts()
blocks = []
for alert in alerts:
blocks.append({
'type': 'section',
'text': {
'type': 'mrkdwn',
'text': f"*{alert['title']}* ({alert['severity'].upper()})\n{alert['description']}"
},
'accessory': {
'type': 'button',
'text': {'type': 'plain_text', 'text': 'Acknowledge'},
'action_id': f"ack_{alert['id']}"
}
})
client.chat_postMessage(
channel=channel_id,
blocks=blocks,
text=f"You have {len(alerts)} active alerts"
)
elif command.startswith('/noah mute'):
# Mute project alerts
project_id = command.split()[-1]
mute_alerts(project_id, duration='1h')
client.chat_postMessage(
channel=channel_id,
text=f"Alerts muted for project {project_id} for 1 hour"
)
elif command == '/noah status':
# Get system status
status = get_system_status()
client.chat_postMessage(
channel=channel_id,
text=f"""
System Status:
- Active Projects: {status['active_projects']}
- Requests (24h): {status['requests_24h']}
- Active Alerts: {status['active_alerts']}
- System Health: {status['health']}
"""
)
Notification Channels
Email Notifications
- Customizable templates
- Recipient lists per severity
- HTML or plain text format
- Digest mode available (daily/weekly)
Slack Integration
- Real-time notifications
- Interactive buttons (acknowledge, mute)
- Thread replies for updates
- Custom channel routing
Microsoft Teams
- Adaptive cards with rich formatting
- Action buttons
- Channel mentions
- Priority notifications
Webhooks
- Custom endpoint integration
- JSON payload
- Retry logic with exponential backoff
- Signature verification
Best Practices
Alert Configuration
- Start with pre-built templates
- Adjust thresholds based on baseline metrics
- Use cooldown periods to prevent alert fatigue
- Enable auto-resolve for transient issues
- Test alert delivery before production
Notification Strategy
- Route critical alerts to multiple channels
- Use escalation policies for unacknowledged alerts
- Configure quiet hours for non-critical alerts
- Set up on-call rotation for critical systems
- Document alert response procedures
Alert Tuning
- Review alert frequency weekly
- Adjust thresholds based on false positive rate
- Archive unused alert rules
- Consolidate similar alerts
- Update documentation as thresholds change