I have a small icinga2 installation set up, which was using ido and from time to time ran into mariadb deadlocks. So I migrated to icingadb, which now also gives me a major headache. Here is some info about the system:
icinga2 r2.14.2-1
Ubuntu 22.04.4 LTS
icingadb v1.2.0
icingadb-redis 7.0.12
Icinga2 and Icingadb-redis are running. When starting icingadb, it stops soon after with a timeout message:
2024-05-08T10:22:55.323+0200 INFO icingadb Starting Icinga DB daemon (v1.2.0)
2024-05-08T10:22:55.323+0200 INFO icingadb Connecting to database at 'localhost:0'
2024-05-08T10:22:55.324+0200 INFO icingadb Connecting to Redis at '127.0.0.1:6380'
2024-05-08T10:22:55.325+0200 INFO icingadb Starting history sync
2024-05-08T10:23:15.326+0200 INFO history-sync Synced 6 state history items
2024-05-08T10:23:15.326+0200 INFO history-sync Synced 1 notification history items
2024-05-08T10:23:19.715+0200 INFO heartbeat Received Icinga heartbeat {"environment": "4be93c2398ccd33be84ecda5b98fa6ab73fd6219"}
2024-05-08T10:23:19.765+0200 INFO icingadb Taking over {"reason": "other instance's heartbeat has expired"}
2024-05-08T10:23:19.765+0200 INFO icingadb Starting config sync
2024-05-08T10:23:19.765+0200 INFO icingadb Starting initial state sync
2024-05-08T10:23:19.765+0200 INFO icingadb Starting overdue sync
2024-05-08T10:23:19.813+0200 INFO config-sync Updating 61 items of type service state
2024-05-08T10:23:19.813+0200 INFO config-sync Updating 17 items of type host state
2024-05-08T10:23:19.849+0200 INFO icingadb Starting config runtime updates sync
2024-05-08T10:23:19.854+0200 INFO config-sync Finished config sync in 87.911532ms
2024-05-08T10:23:20.068+0200 INFO icingadb Starting history retention
2024-05-08T10:23:20.068+0200 INFO config-sync Finished initial state sync in 302.970213ms
2024-05-08T10:23:20.068+0200 INFO icingadb Starting state runtime updates sync
2024-05-08T10:23:30.615+0200 FATAL icingadb read tcp 127.0.0.1:38666->127.0.0.1:6380: i/o timeout
can't perform "[xread count 4096 block 1000 streams icinga:history:stream:notification 1715156137052-0]"
github.com/icinga/icingadb/pkg/icingaredis.WrapCmdErr
github.com/icinga/icingadb/pkg/icingaredis/utils.go:121
github.com/icinga/icingadb/pkg/icingaredis.(*Client).XReadUntilResult
github.com/icinga/icingadb/pkg/icingaredis/client.go:204
github.com/icinga/icingadb/pkg/icingadb/history.Sync.readFromRedis
github.com/icinga/icingadb/pkg/icingadb/history/sync.go:114
github.com/icinga/icingadb/pkg/icingadb/history.Sync.Sync.func1
github.com/icinga/icingadb/pkg/icingadb/history/sync.go:83
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
runtime.goexit
runtime/asm_amd64.s:1695
can't read history
github.com/icinga/icingadb/pkg/icingadb/history.Sync.readFromRedis
github.com/icinga/icingadb/pkg/icingadb/history/sync.go:116
github.com/icinga/icingadb/pkg/icingadb/history.Sync.Sync.func1
github.com/icinga/icingadb/pkg/icingadb/history/sync.go:83
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
runtime.goexit
runtime/asm_amd64.s:1695
I already dropped the icingadb database and recreated it, reinstalled icingadb-redis.
Due to the log, the error is caused by Redis. As it is an I/O timeout happening almost exactly after five minutes, Icinga DBâs default timeout, I would guess that the Redis isnât operating correctly. (Edit: I failed at reading timestamps.)
Could you please verify that your Redis is running? As you mentioned icingadb-redis, please show systemctl status icingadb-redis or try to establish a connection via the Redis CLI.
Furthermore, could you please elevate the log level to debug in the configuration and post this more detailed log?
2024-05-08T15:42:21.167+0200 DEBUG database Finished executing "UPDATE \"service_state\" SET \"is_overdue\" = :is_overdue, \"id\" = :id WHERE id = :id" with 1 rows in 8.188803ms
2024-05-08T15:42:21.285+0200 DEBUG high-availability Received heartbeat from the future {"time": "2024-05-08T15:48:37.345+0200"}
At timestamp 2024-05-08T15:42:21.285+0200 there was a signal from the future: 2024-05-08T15:48:37.345+0200
This is a difference of over 6 minutes, which would explain running into a 5 minute timeout. I just donât know, how this would happen:
Itâs only one icinga master, no HA setup - thereâs no second machine.
system and hw clock are synced
I also see that a lot of services seem to be overdue, despite beeing well within the allowed timeframe.
Oh damn - the problem is fixed now.
Hereâs the reason for the problem:
Virtual environment
On the Host (VSphere) the NTP wasnât working, so time was a bit off
Syncing of time between host and guest was enabled
NTP Client on guest was enabled.
So the correct time on the client was sometimes overwritten with the incorrect time on the host but nearly immediately corrected again, which was nevertheless long enough to fire those bogous timeouts.
Well, I have done next to nothing. Thank you for debugging Icinga DB and unveiling this error.
I have taken another look at the code raising block this error and would say that there is nothing Icinga DB can do to detect the clock skew by itself.
The error is a TCP IO timeout, probably caused by the jumping system clock. Thus, I would leave things as they are and have marked your solution as solving this thread. However, if you have further ideas, please feel free to post them here.
Regarding your âdeadlocksâ: Did you do regular (initiated by cron) database-backups and did you use the option ââsingle-transactionâ? Because this might be the cause of âdeadlocksâ and IcingaDB wonât solve this problem in the long term.
I just want to say thank you. I ran into this issue on one of my satellites and, for the life of me, couldnât figure out why the icingadb service kept crashing. Once I turned on debugging, I was that same âReceived heartbeat from the futureâ message. I turned off time sync from the vm host, re-synced the ntp time on the satellite, restarted the service, and bam, no more crashes. Thank you for saving my upcoming vacation time