Monitoring your active deployments is key for delivering business value and staying ahead of potential problems. Skafos provides 3 primary tools for you to diagnose, investigate, and analyze the ins and outs of your Deployments:
- Logs - live and historical, queryable, analyzable
- Metrics - live and historical, system performance, model performance, training metrics
- Alerts - user-defined notifications sent via email or viewable on the Skafos Dashboard
Your Deployment generates logs for each Job. This includes both system logs, and logs defined by you with a simple Skafos SDK call from within your Job. If you want to see what’s happening in real-time, or if you’re interested in examining historical logs, use the CLI or Dashboard to investigate. On the Dashboard, we give you the ability to search for logs using specific keywords or phrases.
Skafos also tracks various system and job-specific metrics for your active Deployments:
- System Metrics
-- Resource Utilization - CPUs, Memory
-- Deployment Status
User Defined Metrics
To view custom metrics that have been reported for a specific job, navigate to the job and you'll see metric names for each type of metric that has been reported. In the case below the job has only one type of metric: "Model Loss".
"Model Loss" metrics have been reported for this job.
Charts are initially shown in a closed state, so to see the chart populated with data, click "Show". If the job is running, the chart will periodically check & update the chart when there is new data.
If a Job fails, or something goes unexpectedly wrong, you need to be the first to know. Using the Skafos SDK, you can log special alerts that fire a notification in the Skafos Dashboard and to your email inbox.