IDO DB Usage + Pending Queries on HA Master Setup Growing + Overdue Checks

revstar · May 12, 2022, 11:47am

Icinga2 Version 2.13.2-1
System CentOS 7.7.1908

Icinga Web 2 Version 2.10.1
PHP Version 7.3.29
icinga/icinga-php-library 0.8.1
icinga/icinga-php-thirdparty 0.10.0

Loaded Modules: doc/map/monitoring

Setup 2X Masters in HA configuration + 50+ satellites
roughly 9k Hosts 90K services

Dedicated Hosts for Icinga Web 2 and PGSQL 9.2.24

We’ve recently upgraded from Icinga 2.12 → 2.13 and Icinga Web 2 2.8 → 2.10 and added the maps module and business process module (currently disabled)

Our postgresql database has grown more than 12GB in the last few days from ~20GB to ~45GB, additionally, we’ve had all our hosts and service lapse into overdue status. Logs show an increase in pending queries on IdoPgsqlConnection:

information/IdoPgsqlConnection: Pending queries: 382355 (Input: 1434/s; Output: 1432/s)
information/IdoPgsqlConnection: Pending queries: 382533 (Input: 1428/s; Output: 1408/s)
information/IdoPgsqlConnection: Pending queries: 382691 (Input: 1420/s; Output: 1404/s)
information/IdoPgsqlConnection: Pending queries: 383250 (Input: 1450/s; Output: 1408/s)
information/IdoPgsqlConnection: Pending queries: 383835 (Input: 1404/s; Output: 1348/s)
information/IdoPgsqlConnection: Pending queries: 384902 (Input: 1491/s; Output: 1401/s)
information/IdoPgsqlConnection: Pending queries: 385100 (Input: 1437/s; Output: 1414/s)
information/IdoPgsqlConnection: Pending queries: 385220 (Input: 1424/s; Output: 1412/s)

These queries grow until we restart the icinga2 service on the primary master.

We could not figure out what was causing the output to lag behind the input resulting in the Pending queries, I decided to bump up the shared_buggers, work_mem and maintenance_work_mem allocations in postgresql.conf, reloaded the postgresql service and restarted Icinga on the master host and the pending queries went as low as 5000 before steadily climbing again to roughly 17000 and rising in ~25 minutes

During troubleshooting, I also checked the queue on our second master and found the output to be much much lower than that of our first master I am unsure of why the performance of the second master is so poor.

information/IdoPgsqlConnection: PGSQL IDO instance id: 1 (schema version: '1.14.3')
information/IdoPgsqlConnection: Pending queries: 12255 (Input: 1202/s; Output: 26/s)
information/IdoPgsqlConnection: Pending queries: 24367 (Input: 1211/s; Output: 41/s)
information/IdoPgsqlConnection: Pending queries: 36538 (Input: 1211/s; Output: 41/s)
information/IdoPgsqlConnection: Pending queries: 48802 (Input: 1221/s; Output: 41/s)
information/IdoPgsqlConnection: Pending queries: 60777 (Input: 1175/s; Output: 28/s)
information/IdoPgsqlConnection: Pending queries: 72554 (Input: 1167/s; Output: 31/s)
information/IdoPgsqlConnection: Pending queries: 84508 (Input: 1184/s; Output: 32/s)
information/IdoPgsqlConnection: Pending queries: 96163 (Input: 1146/s; Output: 30/s)
information/IdoPgsqlConnection: Pending queries: 108317 (Input: 1205/s; Output: 31/s)
information/IdoPgsqlConnection: Pending queries: 120425 (Input: 1197/s; Output: 31/s)
information/IdoPgsqlConnection: Pending queries: 131593 (Input: 1148/s; Output: 78/s)
information/IdoPgsqlConnection: Pending queries: 142611 (Input: 1135/s; Output: 78/s)
information/IdoPgsqlConnection: Pending queries: 153818 (Input: 1152/s; Output: 79/s)
information/IdoPgsqlConnection: Pending queries: 165133 (Input: 1166/s; Output: 79/s)
information/IdoPgsqlConnection: Pending queries: 176207 (Input: 1144/s; Output: 78/s)
information/IdoPgsqlConnection: Pending queries: 187482 (Input: 1168/s; Output: 79/s)

Could any of these issues be related to these indexes that were marked as “unused since last statistics” by postgresqltuner?

icinga_commands.commands_i_id_idx icinga_configfiles.configfiles_i_id_idx icinga_configfilevariables.configfilevariables_i_id_idx icinga_contact_addresses.contact_addresses_i_id_idx icinga_contact_notificationcommands.contact_notifcommands_i_id_idx icinga_contactgroup_members.cntgrpmbrs_cgid_coid icinga_contactgroup_members.contactgroup_members_i_id_idx icinga_contactgroups.contactgroups_i_id_idx icinga_contactnotificationmethods.contact_notif_meth_notif_idx icinga_contacts.contacts_i_id_idx icinga_contactstatus.contactstatus_i_id_idx icinga_customvariables.customvariables_i_id_idx icinga_customvariables.icinga_customvariables_i icinga_customvariablestatus.customvariablestatus_i_id_idx icinga_customvariablestatus.icinga_customvariablestatus_i icinga_endpoints.idx_endpoints_zone_object_id icinga_endpointstatus.idx_endpointstatus_zone_object_id icinga_eventhandlers.eventhandlers_i_id_idx icinga_eventhandlers.eventhandlers_time_id_idx icinga_externalcommands.externalcommands_time_id_idx icinga_host_contactgroups.host_contactgroups_i_id_idx icinga_host_contacts.host_contacts_i_id_idx icinga_host_parenthosts.host_parenthosts_i_id_idx icinga_hostchecks.hostchecks_time_id_idx icinga_hostchecks.hostchks_h_obj_id_idx icinga_hostdependencies.hostdependencies_i_id_idx icinga_hostdependencies.idx_hostdependencies icinga_hostescalation_contactgroups.hostesc_cgroups_i_id_idx icinga_hostescalation_contacts.hostesc_contacts_i_id_idx icinga_hostescalations.hostesc_i_id_idx icinga_hostgroup_members.hostgroup_members_i_id_idx icinga_hostgroups.hostgroups_i_id_idx icinga_hosts.hosts_i_id_idx icinga_hoststatus.hoststatus_check_type_idx icinga_hoststatus.hoststatus_current_state_idx icinga_hoststatus.hoststatus_event_hdl_en_idx icinga_hoststatus.hoststatus_ex_time_idx icinga_hoststatus.hoststatus_flap_det_en_idx icinga_hoststatus.hoststatus_i_id_idx icinga_hoststatus.hoststatus_is_flapping_idx icinga_hoststatus.hoststatus_latency_idx icinga_hoststatus.hoststatus_p_state_chg_idx icinga_hoststatus.hoststatus_pas_chks_en_idx icinga_hoststatus.hoststatus_problem_ack_idx icinga_hoststatus.hoststatus_sch_downt_d_idx icinga_hoststatus.hoststatus_stat_upd_time_idx icinga_hoststatus.hoststatus_state_type_idx icinga_logentries.loge_time_idx icinga_runtimevariables.runtimevariables_i_id_idx icinga_scheduleddowntime.idx_downtimes_session_del icinga_service_contactgroups.service_contactgroups_i_id_idx icinga_service_contacts.service_contacts_i_id_idx icinga_servicechecks.servicechecks_time_id_idx icinga_servicechecks.servicechks_s_obj_id_idx icinga_servicedependencies.idx_servicedependencies icinga_serviceescalation_contactgroups.serviceesc_cgroups_i_id_idx icinga_serviceescalation_contacts.serviceesc_contacts_i_id_idx icinga_serviceescalations.serviceesc_i_id_idx icinga_servicegroup_members.servicegroup_members_i_id_idx icinga_servicegroups.servicegroups_i_id_idx icinga_services.services_i_id_idx icinga_servicestatus.srvcstatus_check_type_idx icinga_servicestatus.srvcstatus_event_hdl_en_idx icinga_servicestatus.srvcstatus_ex_time_idx icinga_servicestatus.srvcstatus_flap_det_en_idx icinga_servicestatus.srvcstatus_is_flapping_idx icinga_servicestatus.srvcstatus_latency_idx icinga_servicestatus.srvcstatus_p_state_chg_idx icinga_servicestatus.srvcstatus_pas_chks_en_idx icinga_servicestatus.srvcstatus_problem_ack_idx icinga_servicestatus.srvcstatus_sch_downt_d_idx icinga_servicestatus.srvcstatus_stat_upd_time_idx icinga_servicestatus.srvcstatus_state_type_idx icinga_statehistory.statehist_i_id_o_id_s_ty_s_ti icinga_systemcommands.systemcommands_i_id_idx icinga_systemcommands.systemcommands_time_id_idx icinga_timeperiod_timeranges.timeperiod_timeranges_i_id_idx icinga_timeperiods.timeperiods_i_id_idx icinga_zones.idx_zones_parent_object_id icinga_zonestatus.idx_zonestatus_parent_object_id

Does anyone have any ideas of what I could do next? We’ve cleared Icinga.state files and /var/lib/icinga2/api/log files prior to restarting icinga as that has worked in the past, but I am unsure of the consequences of doing so.

ChrissK · May 17, 2022, 8:11am

Hi there,

I experienced something like this in the past myself - growing up to ~200GB and lower performance in general.
For me there was a problem with PostgreSQL VACUUM - which should be running by autovacuum.
But at some point there was too may UPDATE transactions that PostgreSQL didn’t managed to perform the autovacuum for some tables (especially “servicestatus”).

I tried to manually VACUUM my tables which worked fine for most of them, but a few (again especially “servicestatus”) couldn’t be done within >15h.
So I decided to stop all applications and perform a VACUUM FULL - pretty fast (~5 minutes) and like expected DB size very small again.
I also did a ANALYZE for all Tables - since then I perform manually scheduled VACUUM and ANALYZE with - at least in my case - much better performance and DB stays on a small size.

Not sure this will solve your problem, but it’s worth a try I guess

Why your second master got much lower performance than your primary - I don’t have any (good) thourghts about it.
Is there a possibility that you’re facing any kind of network issues on the second one, bad host performance in general (CPU / MEM / IO) which could explain the process itself can’t process this much events?

BR,
Chris

revstar · May 24, 2022, 7:50am

Thank you, I’ll need to take a look at my our tables and see if the “servicestatus” table is one of the culprits.

I ended up re-creating the IDO and cutting over both primary nodes + icingaweb2 to the new backend and dropped the original. The queue initially grew to 1.7M pending items then emptied itself over the course of about five hours.

We’re currently experiencing overdue checks again with a queue that is roughly 30K in depth and hovering there for hours on end.