Icingadb stops nearly immediately: redis connection timeout?

kernelhacker · May 8, 2024, 8:26am

Hello icinga-pros

I have a small icinga2 installation set up, which was using ido and from time to time ran into mariadb deadlocks. So I migrated to icingadb, which now also gives me a major headache. Here is some info about the system:

icinga2 r2.14.2-1
Ubuntu 22.04.4 LTS
icingadb v1.2.0
icingadb-redis 7.0.12

Icinga2 and Icingadb-redis are running. When starting icingadb, it stops soon after with a timeout message:

2024-05-08T10:22:55.323+0200    INFO    icingadb        Starting Icinga DB daemon (v1.2.0)
2024-05-08T10:22:55.323+0200    INFO    icingadb        Connecting to database at 'localhost:0'
2024-05-08T10:22:55.324+0200    INFO    icingadb        Connecting to Redis at '127.0.0.1:6380'
2024-05-08T10:22:55.325+0200    INFO    icingadb        Starting history sync
2024-05-08T10:23:15.326+0200    INFO    history-sync    Synced 6 state history items
2024-05-08T10:23:15.326+0200    INFO    history-sync    Synced 1 notification history items
2024-05-08T10:23:19.715+0200    INFO    heartbeat       Received Icinga heartbeat       {"environment": "4be93c2398ccd33be84ecda5b98fa6ab73fd6219"}
2024-05-08T10:23:19.765+0200    INFO    icingadb        Taking over     {"reason": "other instance's heartbeat has expired"}
2024-05-08T10:23:19.765+0200    INFO    icingadb        Starting config sync
2024-05-08T10:23:19.765+0200    INFO    icingadb        Starting initial state sync
2024-05-08T10:23:19.765+0200    INFO    icingadb        Starting overdue sync
2024-05-08T10:23:19.813+0200    INFO    config-sync     Updating 61 items of type service state
2024-05-08T10:23:19.813+0200    INFO    config-sync     Updating 17 items of type host state
2024-05-08T10:23:19.849+0200    INFO    icingadb        Starting config runtime updates sync
2024-05-08T10:23:19.854+0200    INFO    config-sync     Finished config sync in 87.911532ms
2024-05-08T10:23:20.068+0200    INFO    icingadb        Starting history retention
2024-05-08T10:23:20.068+0200    INFO    config-sync     Finished initial state sync in 302.970213ms
2024-05-08T10:23:20.068+0200    INFO    icingadb        Starting state runtime updates sync
2024-05-08T10:23:30.615+0200    FATAL   icingadb        read tcp 127.0.0.1:38666->127.0.0.1:6380: i/o timeout
can't perform "[xread count 4096 block 1000 streams icinga:history:stream:notification 1715156137052-0]"
github.com/icinga/icingadb/pkg/icingaredis.WrapCmdErr
        github.com/icinga/icingadb/pkg/icingaredis/utils.go:121
github.com/icinga/icingadb/pkg/icingaredis.(*Client).XReadUntilResult
        github.com/icinga/icingadb/pkg/icingaredis/client.go:204
github.com/icinga/icingadb/pkg/icingadb/history.Sync.readFromRedis
        github.com/icinga/icingadb/pkg/icingadb/history/sync.go:114
github.com/icinga/icingadb/pkg/icingadb/history.Sync.Sync.func1
        github.com/icinga/icingadb/pkg/icingadb/history/sync.go:83
golang.org/x/sync/errgroup.(*Group).Go.func1
        golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
runtime.goexit
        runtime/asm_amd64.s:1695
can't read history
github.com/icinga/icingadb/pkg/icingadb/history.Sync.readFromRedis
        github.com/icinga/icingadb/pkg/icingadb/history/sync.go:116
github.com/icinga/icingadb/pkg/icingadb/history.Sync.Sync.func1
        github.com/icinga/icingadb/pkg/icingadb/history/sync.go:83
golang.org/x/sync/errgroup.(*Group).Go.func1
        golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
runtime.goexit
        runtime/asm_amd64.s:1695

I already dropped the icingadb database and recreated it, reinstalled icingadb-redis.

What can I do to fix this issue?

Thx a lot for any help!

Stefan

apenning · May 8, 2024, 1:13pm

Due to the log, the error is caused by Redis. ~~As it is an I/O timeout happening almost exactly after five minutes, Icinga DB’s default timeout, I would guess that the Redis isn’t operating correctly.~~ (Edit: I failed at reading timestamps.)

Could you please verify that your Redis is running? As you mentioned icingadb-redis, please show systemctl status icingadb-redis or try to establish a connection via the Redis CLI.

Furthermore, could you please elevate the log level to debug in the configuration and post this more detailed log?

kernelhacker · May 8, 2024, 2:57pm

Hi Alvar,

sure, here’s more info:

~$ systemctl status icingadb-redis
● icingadb-redis-server.service - Advanced key-value store
     Loaded: loaded (/lib/systemd/system/icingadb-redis-server.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2024-05-07 16:09:48 CEST; 23h ago
       Docs: http://redis.io/documentation,
             man:icingadb-redis-server(1)
   Main PID: 1861535 (icingadb-redis-)
     Status: "Ready to accept connections"
      Tasks: 5 (limit: 8137)
     Memory: 10.0M
        CPU: 2min 4.800s
     CGroup: /system.slice/icingadb-redis-server.service
             └─1861535 "/usr/bin/icingadb-redis-server 127.0.0.1:6380" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >

Notice: journal has been rotated since unit was started, output may be incomplete.

~$ pgrep icingadb -a
1861535 /usr/bin/icingadb-redis-server 127.0.0.1:6380

~$ redis-cli -h localhost -p 6380
localhost:6380>

So icingadb-redis is running and reacting.

But debug log showed something suspicious, that might explain the timeout (but I have no idea how this could happen):
Full Log: https://home.stefanpokorny.de/icingalog1.txt


2024-05-08T15:42:21.167+0200    DEBUG   database        Finished executing "UPDATE \"service_state\" SET \"is_overdue\" = :is_overdue, \"id\" = :id WHERE id = :id" with 1 rows in 8.188803ms
2024-05-08T15:42:21.285+0200    DEBUG   high-availability       Received heartbeat from the future      {"time": "2024-05-08T15:48:37.345+0200"}

At timestamp 2024-05-08T15:42:21.285+0200 there was a signal from the future: 2024-05-08T15:48:37.345+0200
This is a difference of over 6 minutes, which would explain running into a 5 minute timeout. I just don’t know, how this would happen:

It’s only one icinga master, no HA setup - there’s no second machine.
system and hw clock are synced

I also see that a lot of services seem to be overdue, despite beeing well within the allowed timeframe.

It’s very confusing.

kernelhacker · May 8, 2024, 3:11pm

Oh damn - the problem is fixed now.
Here’s the reason for the problem:

Virtual environment
On the Host (VSphere) the NTP wasn’t working, so time was a bit off
Syncing of time between host and guest was enabled
NTP Client on guest was enabled.

So the correct time on the client was sometimes overwritten with the incorrect time on the host but nearly immediately corrected again, which was nevertheless long enough to fire those bogous timeouts.

Anyway - thx a lot for your help

apenning · May 10, 2024, 8:41am

Well, I have done next to nothing. Thank you for debugging Icinga DB and unveiling this error.

I have taken another look at the code raising block this error and would say that there is nothing Icinga DB can do to detect the clock skew by itself.

github.com

Icinga/icingadb/blob/e31b101f4f41941793adc7b430a0db8bef71b8c3/pkg/icingaredis/client.go#L197-L205


      
          		cmd := c.XRead(ctx, a)
          		streams, err := cmd.Result()
          		if err != nil {
          			if errors.Is(err, redis.Nil) {
          				continue
          			}
          
          			return streams, WrapCmdErr(cmd)
          		}

The error is a TCP IO timeout, probably caused by the jumping system clock. Thus, I would leave things as they are and have marked your solution as solving this thread. However, if you have further ideas, please feel free to post them here.

mrdsam · May 13, 2024, 7:11am

Regarding your “deadlocks”: Did you do regular (initiated by cron) database-backups and did you use the option “–single-transaction”? Because this might be the cause of ‘deadlocks’ and IcingaDB won’t solve this problem in the long term.

msegovia · October 25, 2024, 10:02pm

I just want to say thank you. I ran into this issue on one of my satellites and, for the life of me, couldn’t figure out why the icingadb service kept crashing. Once I turned on debugging, I was that same “Received heartbeat from the future” message. I turned off time sync from the vm host, re-synced the ntp time on the satellite, restarted the service, and bam, no more crashes. Thank you for saving my upcoming vacation time