Logging, Monitoring, and Debugging

On This Page

Overview

There are a variety of ways in which you can log and debug the execution of platform application services, tools, and APIs.

Note

To learn how to use the platform's default monitoring service and pre-deployed Grafana dashboards to monitor application services, see Monitoring Platform Services.

Logging application services (Log forwarder and Elasticsearch)
Checking Service Status
Kubernetes tools
Event logs
Cluster support logs
API error information

Note

If you are integrating the platform with other logging tools (e.g. datadog), contact Iguazio Support.

For further troubleshooting assistance, visit Iguazio Support.

Logging Application Services

The platform has a default tenant-wide log-forwarder application service (log-forwarder) for forwarding application-service logs. The logs are forwarded to an instance of the Elasticsearch open-source search and analytics engine by using the open-source Filebeat log-shipper utility. The log-forwarder service is disabled by default. To enable it, on the Services dashboard page, select to edit the log-forwarder service; in the Custom Parameters tab, set the Elasticsearch URL parameter to an instance of Elasticsearch that will be used to store and index the logs; then, save and apply your changes to deploy the service.

Typically, the log-forwarder service should be configured to work with your own remote off-cluster instance of Elasticsearch.

Note

The default transfer protocol, which is used when the URL doesn't begin with "http://" or "https://", is HTTPS.
The default port, which is used when the URL doesn't end with ":<port number>", is port 80 for HTTP and port 443 for HTTPS.

Checking Service Status

In the Service pages, users with both the Service Admin and the Application Read Only view policies can check the status of the pods.

Press Inspect to see the status. You can also download a txt file from the popup.

Kubernetes Tools

You can use the Kubernetes kubectl CLI from a platform web-shell or Jupyter Notebook application service to monitor, log, and debug your Kubernetes application cluster:

Use the get pods command to display information about the cluster's pods:
```
kubectl get pods
```
Use the logs command to view the logs for a specific pod; replace POD with the name of one of the pods returned by the get command:
```
kubectl logs POD
```
Use the top pod command to view pod resource metrics and monitor resource consumption; replace [POD] with the name of one of the pods returned by the get command or remove it to display logs for all pods:
```
kubectl top pod [POD]
```

Note

To run kubectl commands from a web-shell service, the service must be configured with an appropriate service account; for more information about the web-shell service accounts, see The Web-Shell Service.

The get pods and logs commands require the "Log Reader" service account or higher.
The top pod command requires the "Service Admin" service account.

For more information about the kubectl CLI, see the Kubernetes documentation.

IGZTOP - Performance Reporting Tool

IGZTOP is a small tool which is intended to display useful information about pods in the default-tenant namespace. It is a performance troubleshooting tool providing insights on how your MLrun jobs are consuming resources in the cluster. and enhances the standard set of provided Kubernetes tools.

Prerequisites

To use the IGZTOP feature you will need the to have a shell service configured with service admin rights. To setup a new service see Creating a New Service. To run the shell service:

In the side navigation menu, select Services.
On the Services page, press the Shell Service name (link) from the list of services. A shell window opens.

How to Run IGZTOP

Use the following command in the shell service window. Use the flags (options) to customize your results.

  igztop [--sort=<KEY>] [--limit=<KEY>] [--wide] [--all] [--kill-job=<JOB>] [--no-borders] [--check-permissions] [--update]

Options	Description
`-h --help`	Displays the command help
`-v --version`	Displays the current version
`-s --sort=<KEY>`	Sort by key (for example --sort memory)
`-l --limit=<KEY>`	filter using a key:value pair (for example 'node=k8s-node1', 'name=trino')
`-w --wide`	Show additional fields including “requests” and “limits”
`-a --all`	Include pods which are not currently running
`-b --no-borders`	Print the table without borders
`-k --kill-job=<JOB>`	Kill one or more MLRun job pods (use "*" to kill multiple matching pods)
`-c --check-permissions`	Check the permissions of the user running this Service
`-u --update`	Fetch the latest version of IGZTOP

Examples

The following is an example of the default output.

$ igztop
+--------------------------------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| NAME                                             | CPU(m) | MEMORY(Mi) | NODE      | STATUS  | GPUs | GPU Util. | MLRun Proj. | MLRun Owner |
+--------------------------------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| authenticator-fb7dc6df7-vg9tl                    | 1      | 6          | k8s-node2 | Running |      |           |             |             |
| docker-registry-676df9d6f6-x788h                 | 1      | 6          | k8s-node2 | Running |      |           |             |             |
| framesd-8b4c489f-sx9rw                           | 1      | 17         | k8s-node1 | Running |      |           |             |             |
| grafana-5b456dc59-vjszb                          | 1      | 31         | k8s-node1 | Running |      |           |             |             |
| jupyter-58c5bf598f-z86qm                         | 9      | 641        | k8s-node1 | Running | 1/1  | 0.00%     |             |             |
| metadata-envoy-deployment-786b65df7d-lldp2       | 3      | 11         | k8s-node2 | Running |      |           |             |             |
| metadata-grpc-deployment-7cfc8d9b8-prdr8         | 1      | 1          | k8s-node2 | Running |      |           |             |             |
| metadata-writer-7bfd6bf6dd-x8fj9                 | 1      | 140        | k8s-node2 | Running |      |           |             |             |
| metrics-server-exporter-75b49dd6b8-k9r5q         | 5      | 14         | k8s-node2 | Running |      |           |             |             |
| ml-pipeline-565cb8497d-wkx2b                     | 5      | 16         | k8s-node2 | Running |      |           |             |             |
| ml-pipeline-persistenceagent-746949f667-822ch    | 1      | 9          | k8s-node1 | Running |      |           |             |             |
| ml-pipeline-scheduledworkflow-6fd754dfcf-slrtk   | 1      | 9          | k8s-node1 | Running |      |           |             |             |
| ml-pipeline-ui-7694dc85ff-4cjc5                  | 4      | 52         | k8s-node1 | Running |      |           |             |             |
| ml-pipeline-viewer-crd-6fbf9f8fbf-9cwvq          | 2      | 10         | k8s-node2 | Running |      |           |             |             |
| ml-pipeline-visualizationserver-66ccfd8f98-l6g86 | 5      | 80         | k8s-node1 | Running |      |           |             |             |
| mlrun-api-68cfd974db-cgph4                       | 4      | 317        | k8s-node2 | Running |      |           |             |             |
| mlrun-ui-88cbcffc4-8s76d                         | 0      | 11         | k8s-node1 | Running |      |           |             |             |
| monitoring-prometheus-server-78d5bbc9d4-vgrhm    | 5      | 42         | k8s-node2 | Running |      |           |             |             |
| mpi-operator-7589fbff58-87xr7                    | 1      | 13         | k8s-node1 | Running |      |           |             |             |
| mysql-kf-67b6cb589d-dvh94                        | 4      | 460        | k8s-node1 | Running |      |           |             |             |
| nuclio-controller-5bcbd8bcf6-6hqvv               | 1      | 10         | k8s-node2 | Running |      |           |             |             |
| nuclio-dashboard-84ddf6f9bd-c4scb                | 1      | 43         | k8s-node1 | Running |      |           |             |             |
| nuclio-dlx-86cbc4cdb9-wcds2                      | 1      | 5          | k8s-node2 | Running |      |           |             |             |
| nuclio-scaler-7b96584d57-l9cwj                   | 1      | 10         | k8s-node1 | Running |      |           |             |             |
| oauth2-proxy-5bd48c96d8-ghvqq                    | 1      | 4          | k8s-node2 | Running |      |           |             |             |
| trino-coordinator-65d6d64b5f-vc7gh              | 17     | 1319       | k8s-node1 | Running |      |           |             |             |
| trino-worker-76d774b948-dxxjr                   | 8      | 1229       | k8s-node2 | Running |      |           |             |             |
| trino-worker-76d774b948-vlxcn                   | 13     | 1236       | k8s-node1 | Running |      |           |             |             |
| provazio-controller-controller-c9bff698c-xk6r9   | 3      | 50         | k8s-node2 | Running |      |           |             |             |
| shell-7b47994578-z7s6k                           | 50     | 108        | k8s-node2 | Running |      |           |             |             |
| spark-operator-658bdfb6f5-bl2lb                  | 2      | 10         | k8s-node1 | Running |      |           |             |             |
| v3io-webapi-b9b9f                                | 8      | 1147       | k8s-node1 | Running |      |           |             |             |
| v3io-webapi-p4kbq                                | 8      | 1431       | k8s-node2 | Running |      |           |             |             |
| v3iod-fztqm                                      | 13     | 2541       | k8s-node2 | Running |      |           |             |             |
| v3iod-jg68d                                      | 14     | 2536       | k8s-node1 | Running |      |           |             |             |
| v3iod-locator-599b9d5c9f-76vrr                   | 0      | 5          | k8s-node1 | Running |      |           |             |             |
| workflow-controller-7d59d94444-5n7r7             | 1      | 12         | k8s-node2 | Running |      |           |             |             |
+--------------------------------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| SUM                                              | 197m   | 13582Mi    |           |         |      |           |             |             |
+--------------------------------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+

The following examples compares Trino memory usage across all cluster nodes.

$ igztop --sort memory --limit name=trino
+-------------------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| NAME                                | CPU(m) | MEMORY(Mi) | NODE      | STATUS  | GPUs | GPU Util. | MLRun Proj. | MLRun Owner |
+-------------------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| trino-coordinator-65d6d64b5f-vc7gh  | 17     | 1320       | k8s-node1 | Running |      |           |             |             |
| trino-worker-76d774b948-vlxcn       | 12     | 1237       | k8s-node1 | Running |      |           |             |             |
| trino-worker-76d774b948-dxxjr       | 12     | 1230       | k8s-node2 | Running |      |           |             |             |
+-------------------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| SUM                                 | 41m    | 3787Mi     |           |         |      |           |             |             |
+-------------------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+

The following example displays which pods are using GPUs.

$ igztop --sort memory --limit gpu
+--------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| NAME                     | CPU(m) | MEMORY(Mi) | NODE      | STATUS  | GPUs | GPU Util. | MLRun Proj. | MLRun Owner |
+--------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| jupyter-58c5bf598f-z86qm | 10     | 644        | k8s-node1 | Running | 1/1  | 0.00%     |             |             |
+--------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+
| SUM                      | 10m    | 644Mi      |           |         |      |           |             |             |
+--------------------------+--------+------------+-----------+---------+------+-----------+-------------+-------------+

The following example displays the --all flag and includes pods which are not running.

$ igztop -a
+---------------------------------------------------------------+--------+------------+-----------+-----------+--------------------------------+-------------+
| NAME                                                          | CPU(m) | MEMORY(Mi) | NODE      | STATUS    | MLRun Proj.                    | MLRun Owner |
+---------------------------------------------------------------+--------+------------+-----------+-----------+--------------------------------+-------------+
| v3io-webapi-dcknw                                             | 19     | 5069       | k8s-node1 | Running   |                                |             |
| trino-coordinator-68df99bdf9-8swbp                            | 20     | 1855       | k8s-node1 | Running   |                                |             |
| ml-pipeline-viewer-crd-656775dc5c-n4tzl                       | 2      | 11         | k8s-node1 | Running   |                                |             |
| prep-data-5vpsb                                               |        |            | k8s-node1 | Completed | getting-started-tutorial-admin | admin       |
| nuclio-dlx-555d6c4cc5-64mns                                   | 1      | 5          | k8s-node1 | Running   |                                |             |
| workflow-controller-64fdf54f56-8wq2v                          | 1      | 11         | k8s-node1 | Running   |                                |             |
| shell-service-admin-6f4bb85cc8-5d9v7                          | 56     | 89         | k8s-node1 | Running   |                                |             |
| nuclio-dashboard-5bcff7dc4c-b7r4j                             | 7      | 43         | k8s-node1 | Running   |                                |             |
| prep-data-fblfn                                               |        |            | k8s-node1 | Completed | getting-started-tutorial-admin | admin       |
| v3iod-x5fmt                                                   | 16     | 5080       | k8s-node1 | Running   |                                |             |
| prep-data-77k4m                                               |        |            | k8s-node1 | Completed | getting-started-tutorial-admin | admin       |
| spark-m9zrumpebc-rdxs6-worker-8647d4466f-rjk9h                | 9      | 302        | k8s-node1 | Running   |                                |             |
| metadata-grpc-deployment-6475f89f9b-t2gns                     | 1      | 1          | k8s-node1 | Running   |                                |             |
| metadata-writer-85ddb586d-wzz8v                               | 1      | 140        | k8s-node1 | Running   |                                |             |
| spark-m9zrumpebc-rdxs6-master-5fd5d6964b-bk527                | 8      | 361        | k8s-node1 | Running   |                                |             |
| nuclio-controller-dd6fcbc74-4pt82                             | 1      | 13         | k8s-node1 | Running   |                                |             |
| spark-operator-7f698b78dc-8x2sh                               | 2      | 10         | k8s-node1 | Running   |                                |             |
| prep-data-g6dps                                               |        |            | k8s-node1 | Completed | getting-started-tutorial-admin | admin       |
| mpi-operator-68dbcc88cc-z8nls                                 | 1      | 14         | k8s-node1 | Running   |                                |             |
| spark-operator-init-zxp8r                                     |        |            | k8s-node1 | Completed |                                |             |
| mlrun-ui-657cb7f889-rpn7t                                     | 0      | 11         | k8s-node1 | Running   |                                |             |
| prep-data-bgwg2                                               |        |            | k8s-node1 | Completed | getting-started-tutorial-admin | admin       |
| metadata-envoy-deployment-5f6c59895d-59qk9                    | 4      | 11         | k8s-node1 | Running   |                                |             |
| mysql-kf-77b69d9d9-4wwkc                                      | 3      | 463        | k8s-node1 | Running   |                                |             |
| framesd-7b89dc9cfb-t4p7x                                      | 1      | 17         | k8s-node1 | Running   |                                |             |
| ml-pipeline-86dc8d885-z6zpm                                   | 5      | 16         | k8s-node1 | Running   |                                |             |
| spark-master-7cf99697df-9dprz                                 | 7      | 408        | k8s-node1 | Running   |                                |             |
| ml-pipeline-visualizationserver-78ccd99c4b-zqwzb              | 4      | 76         | k8s-node1 | Running   |                                |             |
| ml-pipeline-scheduledworkflow-868c7467f5-6jxst                | 1      | 8          | k8s-node1 | Running   |                                |             |
| metrics-server-exporter-594f9c9c97-hvnsd                      | 8      | 12         | k8s-node1 | Running   |                                |             |
| mlrun-api-5dbb8899cb-5zwh5                                    | 28     | 403        | k8s-node1 | Running   |                                |             |
| shell-none-6d64c8675c-tndv4                                   | 5      | 66         | k8s-node1 | Running   |                                |             |
| provazio-controller-controller-77866b6b98-n7psm               | 4      | 50         | k8s-node1 | Running   |                                |             |
| monitoring-prometheus-server-5c959fc895-hfqg2                 | 6      | 43         | k8s-node1 | Running   |                                |             |
| v3iod-locator-597b4c7c87-jn2gx                                | 0      | 5          | k8s-node1 | Running   |                                |             |
| spark-worker-664878f4b6-mwq2r                                 | 8      | 404        | k8s-node1 | Running   |                                |             |
| prep-data-pq8jm                                               |        |            | k8s-node1 | Completed | getting-started-tutorial-admin | admin       |
| docker-registry-6d8b657f8d-h5297                              | 1      | 45         | k8s-node1 | Running   |                                |             |
| hive-5bfbc84589-5l9rg                                         | 7      | 284        | k8s-node1 | Running   |                                |             |
| ml-pipeline-persistenceagent-56f77bdb5f-vn9ms                 | 1      | 9          | k8s-node1 | Running   |                                |             |
| shell-app-admin-57474fc864-dkzct                              | 5      | 95         | k8s-node1 | Running   |                                |             |
| trino-worker-758dcccf6f-hst2f                                 | 12     | 1809       | k8s-node1 | Running   |                                |             |
| jupyter-77bfdd798d-vbt6r                                      | 9      | 585        | k8s-node1 | Running   |                                |             |
| mariadb-74d5c8cb74-zjq82                                      | 4      | 76         | k8s-node1 | Running   |                                |             |
| shell-log-reader-7d67d7f567-vbldj                             | 5      | 52         | k8s-node1 | Running   |                                |             |
| ml-pipeline-ui-7db966bb54-znch6                               | 4      | 46         | k8s-node1 | Running   |                                |             |
| oauth2-proxy-85b8c78fb7-8ktzx                                 | 1      | 3          | k8s-node1 | Running   |                                |             |
| nuclio-getting-started-tutorial-admin-serving-6d8fbdc9d-r4x2w | 1      | 165        | k8s-node1 | Running   |                                |             |
| authenticator-6647dd8467-r2g9n                                | 1      | 7          | k8s-node1 | Running   |                                |             |
| nuclio-scaler-78cc778b6c-kr7m9                                | 1      | 10         | k8s-node1 | Running   |                                |             |
+---------------------------------------------------------------+--------+------------+-----------+-----------+--------------------------------+-------------+
| SUM                                                           | 339m   | 18868Mi    |           |           |                                |             |
+---------------------------------------------------------------+--------+------------+-----------+-----------+--------------------------------+-------------+

To delete MLRun use igztop with the -k --kill-job option followed by a job name. Use '*' as a wildcard at the start or end of a job name to kill multiple job pods.

$ igztop -k '*data*'
pod "prep-data-5vpsb" deleted
pod "prep-data-77k4m" deleted
pod "prep-data-bgwg2" deleted
pod "prep-data-fblfn" deleted
pod "prep-data-g6dps" deleted
pod "prep-data-pq8jm" deleted

Event Logs

The Events page of the dashboard displays different platform event logs:

The Event Log tab displays system event logs.
The Alerts tab displays system alerts.
The Audit tab displays a subset of the system events for audit purposes — security events (such as a failed login) and user actions (such as creation and deletion of a container).

The Events page is visible to users with the IT Admin management policy — who can view all event logs — or to users with the Security Admin management policy — who can view only the Audit tab.

You can specify the email of a user with the IT Admin management policy to receive email notification of events. Press the Settings icon (), then type the user name in Users to Notify and press Apply. Verify that a test email is received. If not, check its status in the Events > Event Log tab.

Events in the Event Log Tab

Event class	Event kind	Event description
System	System.Cluster.Offline	Cluster 'cluster_name' moved to offline mode
System	System.Cluster.Shutdown	Cluster 'cluster_name' shutdown
System	System.Cluster.Shutdown.Aborted	Cluster 'cluster_name' shutdown aborted
System	System.Cluster.Online	Cluster 'cluster_name' moved to online mode
System	System.Cluster.Maintenance	Cluster 'cluster_name' moved to maintenance mode
System	System.Cluster.OnlineMaintenance	Cluster 'cluster_name' moved to online maintenance mode
System	System.Cluster.Degraded	Cluster 'cluster_name' is running in degraded mode
System	System.Cluster.Failback	Cluster 'cluster_name' moved to failback mode
System	System.Cluster.DataAccessType.ReadOnly	Successfully changed cluster 'cluster_name' data access type to read only
System	System.Cluster.DataAccessType.ReadWrite	Successfully changed cluster 'cluster_name' data access type to read/write
System	System.Cluster.DataAccessType.ContainerSpecific	Successfully changed data access type of data containers
System	System.Node.Down	Node 'node_name' is down
System	System.Node.Offline	Node 'node_name' is offline
System	System.Node.Online	Node 'node_name' is online
System	System.Node.Initialization	Node 'node_name' is in initialization state
Software	Software.ArtifactGathering.Job.Started	Artifact gathering job started on node 'node_name'
Software	Software.ArtifactGathering.Job.Succeeded	Artifact gathering completed successfully on node 'node_name'
Software	Software.ArtifactGathering.Job.Failed	Artifact gathering failed on node 'node_name'
Software	Software.ArtifactBundle.Upload.Succeeded	System logs were uploaded to 'upload_paths' successfully
Software	Software.ArtifactBundle.Upload.Failed	Logs collection could not be uploaded to 'upload_paths'
Software	Software.IDP.Synchronization.Started	IDP synchronization with 'IDP server' has been started.
Software	Software.IDP.Synchronization.Completed	IDP synchronization with 'IDP server' has been complated.
Software	Software.IDP.Synchronization.Periodic.Failed	IDP synchronization with 'IDP server' failed to complete periodic update.
Software	Software.IDP.Synchronization.Failed	IDP synchronization with 'IDP server' failed
Hardware	Hardware.UPS.NoAcPower	UPS 'upsId' connected to Node 'nodeName' lost AC power
Hardware	Hardware.UPS.LowBattery	UPS 'upsId' connected to Node 'nodeName' battery is low
Hardware	Hardware.UPS.PermanentFailure	UPS 'upsId' connected to Node 'nodeName' in failed state
Hardware	Hardware.UPS.AcPowerRestored	UPS 'upsId' connected to Node 'nodeName' AC power restored
Hardware	Hardware.UPS.Reachable	UPS 'upsId' connected to Node 'nodeName' is reachable	y
Hardware	Hardware.UPS.Unreachable	UPS 'upsId' connected to Node 'nodeName' is unreachable
Hardware	Hardware.Network.Interface.Up	Network interface to 'interfaceName' on node 'nodeName' - link regained
Hardware	Hardware.Network.Interface.Down	Network interface to'interfaceName' on node 'nodeName' - link disconnected
Hardware	Hardware.temperature.high	Drive on node 'nodeName' temperature is above normal. Temperature is 'temp'.
Capacity	Capacity.StoragePool.UsedSpace.High	Space on storage pool 'pool_name' has reached current% of the total pool size.
Capacity	Capacity.StoragePoolDevice.UsedSpace.High	Space on storage pool device 'storage_pool_device_name' on storage device 'storage_device_name' has reached current% of the total size.
Capacity	Capacity.Tenant.UsedSpace.High	Space on tenant has reached of the total size
Alert	Alert.Test.External	Test description
Software	Software.Cluster.Reconfiguration.Completed	Reconfiguration on cluster 'cluster_name' completed successfully
Software	Software.Cluster.Reconfiguration.Failed	Reconfiguration on cluster 'cluster_name' failure
Software	Software.Events.Reconfiguration.Completed	Reconfiguration on cluster 'cluster_name' completed successfully
Software	Software.Events.Reconfiguration.Failed	Reconfiguration on cluster 'cluster_name' failure
Software	Software.AppServices.Reconfiguration.Completed	Reconfiguration on cluster 'cluster_name' completed successfully
Software	Software.AppServices.Reconfiguration.Failed	Reconfiguration on cluster 'cluster_name' failure
Software	Software.ArtifactVersionManifest.Reconfiguration.Completed	Reconfiguration on cluster 'cluster_name' completed successfully
Software	Software.ArtifactVersionManifest.Reconfiguration.Failed	Reconfiguration on cluster 'cluster_name' failure
System	System.DataContainer.Normal	DataContainer 'data_container_id' is running in normal mode.
System	System.DataContainer.Degraded	DataContainer 'data_container_id' is running in degraded mode.
System	System.DataContainer.Mapping.GenerationFailed	Failed to generate container mapping for DataContainer data_container_id'
System	System.DataContainer.Mapping.DistributionFailed	Failed to distribute container mapping for DataContainer 'data_container_id'
System	System.DataContainer.Resync.Complete	Resync completed on container 'data_container_id'
System	System.DataContainer.DataAccessType.ReadOnly	Data container 'data_container_id' is running in read only mode
System	System.DataContainer.DataAccessType.ReadWrite	Data container 'data_container_id' is running in read/write mode
System	System.DataContainer.DataAccessType.Update.Failed	Failed to set data access type for data container 'data_container_id'
System	System.Failover.Completed	Failover completed successfully
System	System.Failover.Failed	Failover failed
Software	Software.Email.Sending.Failed	Sending email failed due to 'reason'
Capacity	Capacity.StoragePool.UsableCapacity.CalculationFaile	Failed to calculate usable capacity of storage pool
Hardware	Hardware.Disks.DiskFailed	Storage device 'device_name' on node 'node_name' has failed
System	System.AppCluster.Initialization.Succeeded	App cluster 'name' was initialized successfully
System	System.AppCluster.Initialization.Failed	Failed to initialize app cluster 'name'
System	System.AppCluster.Services.Deployment.Succeeded	Default app services manifest for tenant 'tenant_name' was deployed successfully
System	System.AppCluster.Services.Deployment.Failed	Failed to deploy default app services manifest for tenant 'tenant_name'
System	System.Tenancy.Tenant.Creation.Succeeded	Tenant 'tenant_name' was successfully created
System	System.Tenancy.Tenant.Creation.Failed	Failed to create tenant
System	System.Tenancy.Tenant.Deletion.Succeeded	Tenant 'tenant_name' was successfully deleted
System	System.Tenancy.Tenant.Deletion.Failed	Failed to delete tenant 'tenant_name'
System	System.AppCluster.Tenant.Creation.Succeeded	Tenant 'tenant_name' was successfully created on app cluster 'app_cluster'
System	System.AppCluster.Tenant.Creation.Failed	Failed to create tenant on app cluster
System	System.AppCluster.Tenant.Deletion.Succeeded	Tenant 'tenant_name' was successfully deleted from app cluster 'app_cluster'
System	System.AppCluster.Tenant.Deletion.Failed	Failed to delete tenant 'tenant_name' from app cluster
Capacity	Capacity.StorageDevice.OutOfSpace	Space on storage device under 'service_id' on node 'node_id' is depleted
System	System.AppCluster.Tenant.Update.Succeeded	App services for tenant 'tenant_name' were successfully updated
System	System.AppCluster.Tenant.Update.Failed	Failed to update app services for tenant 'tenant_name'
System	System.AppNode.Created	App node record 'name' was created successfully
System	System.AppNode.Online	App node 'name' is online
System	System.AppNode.Unstable	App node 'name' is unstable
System	System.AppNode.Down	App node 'name' is down
System	System.AppNode.Deleted	App node 'name' was successfully deleted
System	System.AppNode.Offline	App node 'name' is offline
System	System.AppNode.NotReady	App node 'name' is not ready
System	System.AppNode.Preemptible.NotReady	Preemptible app node 'name' is not ready
System	System.AppNode.ScalingUp	App node 'name' is scaling up
System	System.AppNode.ScalingDown	App node 'name' is scaling down
System	System.AppNode.OutOfDisk	App node 'name' is out of disk space
System	System.AppNode.MemoryPressure	App node 'name' is low on memory
System	System.AppNode.DiskPressure	App node 'name' is low on disk space
System	System.AppNode.PIDPressure	App node 'name' has too many processes
System	System.AppNode.NetworkUnavailable	App node 'name' has network connectivity problem
System	System.AppCluster.Shutdown.Failed	App cluster shutdown failed
System	System.AppCluster.Online	App cluster 'name' is online
System	System.AppCluster.Unstable	App cluster 'name' is unstable
System	System.AppCluster.Down	App cluster 'name' is down
System	System.AppCluster.Offline	App cluster 'name' is offline
System	System.AppCluster.Degraded	App cluster 'name' is degraded
System	System.AppService.Online	App service 'name' is online
System	System.AppService.Offline	App service 'name' is down
System	System.CoreAppService.Online	App service 'name' is online (Core services: v3iod, webapi, framesd, nuclio, docker_registry, pipelines, mlrun)
System	System.CoreAppService.Offline	App service 'name' is down
Background Process	Task.Container.ImportS3.Started	S3 container 'container_id' import started.
Background Process	Task.Container.ImportS3.Failed	S3 container 'container_id' import failed.
Background Process	Task.Container.ImportS3.Completed	S3 container 'container_id' import completed successfully.
Security	Security.User.Login.Succeeded	user 'username' successfully logged into the system
Security	Security.User.Login.Failed	user 'username' failed logging into the system
Security	security.Session.Verification.Failed	Failed to verify session for user 'username', session id 'session_id'

Events in the Audit Tab

Event class	Event kind	Event description
UserAction	UserAction.Container.Created	container 'container_id' created on cluster 'cluster_name'
UserAction	UserAction.Container.Deleted	container 'container_id' deleted on cluster 'cluster_name'
UserAction	UserAction.Container.Updated	container 'container_id' updated on cluster 'cluster_name'
UserAction	UserAction.Container.Creation.Failed	container 'container_id' on cluster 'cluster_name' could not be created
UserAction	UserAction.Container.Update.Failed	container 'container_id' on cluster 'cluster_name' could not be updated
UserAction	UserAction.Container.Deletion.Failed	container 'container_id' on cluster 'cluster_name' could not be deleted
UserAction	UserAction.User.Created	user 'username' created on cluster 'cluster_name'
UserAction	UserAction.User.Creation.Failed	user 'username' on cluster 'cluster_name' could not be created
UserAction	UserAction.UserGroup.Created	User group 'group' created on cluster 'cluster_name'
UserAction	UserAction.UserGroup.Deletion.Failed	User group 'group' on cluster 'cluster_name' could not be deleted
UserAction	UserAction.User.Deleted	user 'username' deleted on cluster 'cluster_name'
UserAction	UserAction.User.Deletion.Failed	user 'username' on cluster 'cluster_name' could not be deleted
UserAction	UserAction.User.Updated	user 'username' updated on cluster 'cluster_name'
UserAction	UserAction.User.Update.Failed	user 'username' on cluster 'cluster_name' could not be updated
UserAction	UserAction.UserGroup.Updated	User 'group name' updated on cluster 'cluster_name'.
UserAction	UserAction.UserGroup.Update.Failed	User 'group name' on cluster 'cluster_name' could not be updated
UserAction	UserAction.UserGroup.Creation.Failed	User 'group name' on cluster 'cluster_name' could not be created
UserAction	UserAction.UserGroup.Deleted	User 'group name' deleted on cluster 'cluster_name'
UserAction	UserAction.DataAccessPolicy.Applied	Data access policy for container 'name' on cluster 'cluster' applied
UserAction	UserAction.Tenant.Creation.FailedPasswordEmail	Sending password creation email on tenant creation failed
UserAction	UserAction.User.Creation.FailedPasswordEmail	Sending password creation email on user creation failed
UserAction	UserAction.Services.Deployment.Succeeded	App services for tenant 'tenant_name' were deployed successfully
UserAction	UserAction.Services.Deployment.Failed	Failed to deploy app services for tenant 'tenant_name'
UserAction	UserAction.Project.Created	Project 'name' was created successfully
UserAction	UserAction.Project.Creation.Failed	Project 'name' creation failed
UserAction	UserAction.Project.Updated	Project 'name' updated successfully
UserAction	UserAction.Project.Update.Failed	Project 'name' update failed
UserAction	UserAction.Project.Deleted	Project 'name' deleted successfully
UserAction	UserAction.Project.Deletion.Failed	Project 'name' deletion failed
UserAction	UserAction.Project.Owner.Updated	Owner in project 'name' was changed from %s to %s
UserAction	UserAction.Project.User.Role.Updated	Role for user 'username' in project 'name' was updated from 'old_owner' to 'new_owner'
UserAction	UserAction.Project.UserGroup.Role.Updated	Role for user 'group name' in project 'project_name' was updated from 'old_role' to 'new_role'
UserAction	UserAction.Project.User.Added	User 'group name' was added to project 'name' as 'role_name'
UserAction	UserAction.Project.UserGroup.Added	user 'username' was added to project 'name' as 'role_name'
UserAction	UserAction.Project.User.Removed	user 'username' was removed from project 'name'
UserAction	UserAction.Project.UserGroup.Removed	User group name' was removed from project 'name'
UserAction	UserAction.Network.Created	Network 'name' created on cluster 'cluster_name'
UserAction	UserAction.Network.Creation.Failed	Network 'name' on cluster 'cluster_name' could not be created
UserAction	UserAction.Network.Updated	Network 'name' updated on cluster 'cluster_name'
UserAction	UserAction.Network.Update.Failed	Network 'name' on cluster 'cluster_name' could not be updated
UserAction	UserAction.Network.Deleted	Network 'name' deleted on cluster 'cluster_name'
UserAction	UserAction.Network.Deletion.Failed	Network 'name' on cluster 'cluster_name' could not be deleted
UserAction	UserAction.StoragePool.Created	storage pool 'name' created on cluster 'cluster_name'td
UserAction	UserAction.StoragePool.Creation.Failed	storage pool 'name' on cluster 'cluster_name' could not be created
UserAction	UserAction.Cluster.Updated	Cluster 'cluster_name' updated
UserAction	UserAction.Cluster.Update.Failed	Cluster 'cluster_name' could not be updated
UserAction	UserAction.Cluster.Deleted	Cluster 'cluster_name' deleted
UserAction	UserAction.Cluster.Deletion.Failed	cluster 'cluster_name' could not be deleted
UserAction	UserAction.Cluster.Shutdown	Cluster 'cluster_name' is down per user request 'username'

Cluster Support Logs

Users with the IT Admin management policy can collect and download support logs for the platform clusters from the dashboard. Log collection is triggered for a data cluster, but the logs are collected from both the data and application cluster nodes.

You can trigger collection of cluster support-logs from the dashboard in one of two ways; (note that you cannot run multiple collection jobs concurrently):

On the Clusters page, open the action menu () for a data cluster in the clusters table (Type = "Data"); then select the Collect logs menu option.
On the Clusters page, select to display the Support Logs tab for a specific data cluster — either by selecting the Support logs option from the cluster's action menu () or by selecting the data cluster and then selecting the Support Logs tab; then select Collect Logs from the action toolbar. Optionally, select filter criteria in the Select a filter dialog and press Collect Logs again.

Filters reflect both the log source and the log level. The non-full options are for more concise logs. The full versions provide full logs, which might be requested by Customer support. The context filter is usually used by Customer Support, who supplies the context string, if required.

You can view the status of all collection jobs and download archive files of the collected logs from the data-cluster's Support Logs dashboard tab.

API Error Information

The platform APIs return error codes and error and warning messages to help you debug problems with your application. See, for example, the Error Information documentation in the Data-Service Web-API General Structure reference documentation.

Logging, Monitoring, and Debugging

Overview

Logging Application Services

Checking Service Status

Kubernetes Tools

IGZTOP - Performance Reporting Tool

Prerequisites

How to Run IGZTOP

Examples

Event Logs

Events in the Event Log Tab

Events in the Audit Tab

Cluster Support Logs

API Error Information

See Also