Common
Intended audience
This manual is intended mainly for those who are going to support Nussknacker deployments (especially production ones). We assume prior knowledge of Kafka, Flink, InfluxDB, Kubernetes
The parts describing scenario management and tuning should be accessible also for those authoring them with the Nussknacker Designer.
Prerequisites
To better understand the documentation below one needs to grasp some concepts of metrics, etc. We won’t go into much detail here, only give references to worthy resources.
Some of them are external, but it also helps to read about important Nussknacker concepts in glossary and scenario authoring.
Integration with Apache Kafka
The usual way to ingest and produce events in Nussknacker is via integration with Kafka. To understand and monitor the performance of the Kafka sources it's important to know Kafka consumer details. A good explanation is available at Confluent site: Kafka consumer
Metrics architecture
In the documentation we assume standard metrics setup (available in docker-compose and Helm chart):
- InfluxDB to store and query data
- Grafana to visualize metrics
- InfluxDB reporter with Telegraf relay to deliver metrics to InfluxDB
It’s possible to configure Nussknacker installation to use other metrics setup (e.g. using Prometheus), but then you’re on your own to figure out the details - using provided metrics reporter documentation it shouldn’t be too hard.
Monitoring - metrics
Kafka cluster
Below we provide some valuable resources on how to monitor the Kafka cluster:
Upgrading to new versions
Upgrading Nussknacker
Since all scenarios are deployed to an execution engine, it's possible to first upgrade Nussknacker Designer and redeploy all scenarios with a new version in the next step.
Please see Migration Guide before doing the upgrade - especially when using custom extensions.
Nussknacker operations REST API
Nussknacker provides a few REST endpoints which may be helpful to check if everything is OK - e.g. can be used to implement health checks using standard monitoring tools.
Below we describe endpoints that return general information about the Nussknacker state. Please see Deployment REST API section for endpoints allowing for management of a single scenario.
/api/app/buildInfo GET
- returns basic information about deployed version of Nussknacker. Sample output looks like this:
{
"name": "nussknacker-api",
"gitCommit": "edcbd3b964e09271f54a7e5a8700c4a3e73d0275",
"buildTime": "2022-02-01T09:42:30.748539",
"version": "1.1.1",
"processingType": {
"streaming": {
"name": "generic",
"version": "1.1.1",
"buildTime": "2022-02-01T09:42:30.748539",
"gitCommit": "edcbd3b964e09271f54a7e5a8700c4a3e73d0275"
}
}
}
/api/app/healthCheck GET 200
- use to check if Nussknacker Designer is up/api/app/healthCheck/process/deployment GET
- 200 - if all deployed scenarios are running
- 500 - basic information about deployed scenarios that are not running (e.g. failed, restarting etc.)
/api/app/healthCheck/process/validation GET
- 200 - if all scenarios are valid
- 500 - list of not-valid scenarios
/api/app/config GET
(requires admin permissions) - serialized configuration of Nussknacker Designer and components (NOTE: configuration returned by this endpoint does not have all substitutions resolved, e.g. some environmental variables will not be returned)/api/app/processingtype/reload POST
(requires admin permissions) - reload configuration of models. Used mostly if you use custom components which have dynamic configuration (e.g. list of components depend on external registry, like MLFlow or OpenAPI)
Deployment REST API
Endpoints under “admin”
/api/processManagement/deploy/{processName} POST
- deploy scenario with given name passed as aprocessName
/api/processManagement/cancel/{processName} POST
- cancel scenario with given name passed as aprocessName
. Savepoint won't be saved/api/adminProcessManagement/snapshot/{processName}?savepointDir={path} POST
- make a savepoint of current scenario's state. Returns path where savepoint was saved as a plain string e.g.hdfs:///flink/savepoints/savepoint-71a39f-b2078dc48f16
. Can be used indeploy
endpoint as asavepointPath
parameter/api/adminProcessManagement/stop/{processName}?savepointDir={path} POST
- stop scenario and make a savepoint. Returns path where savepoint was saved as a plain string e.g.hdfs:///flink/savepoints/savepoint-71a39f-b2078dc48f16
. Can be used indeploy
endpoint as asavepointPath
parameter/api/adminProcessManagement/deploy/{processName}?savepointPath={path} POST
- deploy scenario from given savepoint. Example savepoint path:hdfs:///flink/savepoints/savepoint-71a39f-b2078dc48f16
Scenarios - monitoring and troubleshooting
Each scenario has its own performance characteristics and considerations. This section describes common ways to monitor the health of a running scenario and how to handle common problems. Most of the sections are aimed not only at operations people but also at (especially advanced) editors of Nussknacker scenarios.
Nussknacker metrics
One of the crucial aspects of running production streaming jobs is monitoring. In this section we'll explain how the Nussknacker scenario running on the execution engine gives rise to certain metrics, how to process them and display them in Grafana.
Metric types
Below we describe common
Name | Fields reported (with InfluxDB) | Description |
---|---|---|
counter | count | |
histogram | count, max, mean, min, p50, p75, p95, p98, p99, p999, stddev | mainly used for measuring invocation times |
gauge | value | value in given point in time |
instantRate | value | special gauge for measuring throughput in given moment (standard meter has only mean throughput in 1, 5 or 15 minutes) |
Common metrics
Measurement | Additional tags | Metric type | Notes |
---|---|---|---|
nodeCount | nodeId | counter | used e.g. by count functionality |
error.instantRate | - | instantRate | |
error.instantRateByNode | nodeId | instantRate | nodeId is unknown if we fail to detect exact place |
service.OK | serviceName | histogram + instantRate | histogram of successful invocation times |
service.FAIL | serviceName | histogram + instantRate | histogram of successful invocation times |
Streaming metrics
Measurement | Additional tags | Metric type | Description |
---|---|---|---|
source | nodeId | instantRate + counter | |
eventtimedelay.histogram | nodeId | histogram | only for sources with eventTime, measures delay from event time to system time |
eventtimedelay.minimalDelay | nodeId | gauge | time from last event (eventTime) to system time |
end | nodeId | instantRate + counter | for sinks and end processors |
dead_end | nodeId | instantRate + counter | for event filtered out on filters, switches etc. |
Each of these metrics comes with the tags specific for execution engine:
Nussknacker comes with a Grafana dashboard; you can use it or extend/modify according to your needs.
Monitoring - is my scenario doing well?
The two most important things that should be monitored after deployment of a scenario are:
- Performance - is scenario processing events fast enough?
- Error rates - are there any problems with execution
Checking source lag
There are two types of metrics that can give useful information about potential Kafka lags
- How large is lag on Kafka source partitions
- How much time has passed since the last processed event
Please note that:
- Kafka lag is reported with some delay (offsets are not committed immediately). It’s normal to have temporary spikes.
- Large differences between slots in lag may point to data skew or e.g. problematic data for some keys
Handling typical scenario errors
When you see errors in monitoring, like on screen below:
it means that there were failures during scenario execution - most often they are caused by errors during SpeL expression evaluation, which couldn't be detected during validation.
The most common problems are:
- NullPointerException - please see SpEL documentation [link Safe Navigation] to see how to check if some value is null
- Conversion errors - e.g. #NUMBER.toNumber - this usually indicates problem with the source data
To see the exact error you can look at (depending on the configuration of the Exception handler):
- Check logs of your execution engine e.g. in ELK (be sure to check all of them)
2021-06-24 07:43:32,071 INFO pl.touk.nussknacker.engine.spel.SpelExpression [] - Expression evaluation failed. Original #client.status != "GOLD", ctxId: DetectLargeTransactions-kafka-registry-typed-json-1-3847, message: EL1021E: A problem occurred whilst attempting to access the property 'status': 'Cannot invoke method/property status on null object'
- Contents of configured errors' topic on Kafka:
{
"processName": "DetectLargeTransactions",
"nodeId": "only large ones",
"message": "Expression [20 / (#input.amount % 4) > 0] evaluation failed, message: / by zero",
"exceptionInput": "20 / (#input.amount % 4) > 0",
"inputEvent": null,
"stackTrace": "pl.touk.nussknacker.engine.spel.SpelExpressionEvaluationException:... 34 more",
"timestamp": 1624863781881,
"host": "0bda71c51449",
"additionalData": {}
}
General production readiness checklist
Nussknacker configuration
- Make sure to configure RocksDB properly (e.g.
rocksdb.enable
) - in case of Flink execution engine - Configure restart strategy and error handler
- By default, scenarios are not restarted, so that errors do not go unnoticed
- Kafka exception handler should be used in production - configure error topic, with long enough retention
- Check if Nussknacker has access to all needed services:
- Flink REST API
- K8s REST API
- Savepoints location
- Schema registry - to detect data types
- Kafka cluster - to be able to generate test data
- InfluxDB - to have access to counts data
- Test correctness of the configuration
- Perform a few test scenario deployments, try also redeployment to check if savepoints are accessible
- Generate test data from some Kafka topic
- Check if counts can be accessed
- See if metrics links are working properly
Setup metrics/alerting
- Prepare dashboards/alerts for monitoring cluster health, according to your needs.
- Setup healthchecks, based on Nussknacker API