November 17, 2019 Categories: linux, monitoring

Lan ETS | Monitoring a large scale LAN event

Ping ping ping - it’s down!

Monitoring at an event of our scale is a critical part of our design process. The different monitoring services are our eyes and ears and allow us to quickly understand the state of our network. Since our event is over a very short period of time, our goal is to have a smooth experience for the player, from the moment he plugs his computer into our network. There is no worst feeling than looking at a row of players getting off their seats and being unsure of it’s a network issue!

Here is a brief overview of the different services we use to monitor our network.

LibreNMS

A open source fork of Observium, it’s designed as a plug and play monitoring platform. You simply point it to a SNMP aware device and it will do it’s best to associate common OIDS and MIBS to “automagically” monitor the target. It then polls the OIDS every 5 minutes and updates the corresponding RRD files. It is also possible to export the data into a more modern timeseries database named InfluxDB. This allows us to access the data in a more flexible way using other services that are not compatible with the RRD format.

That said, it’s not a perfect platform. Since it’s designed around auto discovering devices, it can only detect whatever OIDS the developers configured for each platform. This means that if you are running very recent gear or equipment from obscure manufacturers, you might be missing a lot of information from the auto discovery process. There is currently no easy way to add extra OIDS to your installation.

Another issue is the polling interval as the tool was initially built around a 5 minutes window. This is acceptable in most environments but for our event, we need something that gives us almost real-time data on the state of the network. This is why we used Shinken to provide the actual Up/Down status. We rely on LibreNMS for graph and performance data.

Shinken

Shinken is built as a Nagios drop-in replacement. We use it with a very short polling rate of 10 seconds with ICMP checks. We monitor only our switches, routers and servers. With around 100 or so pieces of equipment, it does tax our server resources, but it allows us for near real-time awareness of the state of our network. Within 10 seconds, we can know if a player kicked the power cord from a switch and if any other equipment is unreachable.

Nagvis

Nagvis was initially built as a mapping tool for Nagios. But it supports a wide variety of different back-ends as ways to fetch data. We use it in conjunction with the “Livestatus” module for Shinken. We upload a map of the venue with all of our equipment displayed. This allows us to quickly know if everything is up and running and at the same time, react if anything turns red.

Oxidized

Oxidized is a modern replacement for Rancid. It’s a network equipment configuration backup tool. It logs into a piece of equipment using credentials you provided and runs any commands you want. In the case of Cisco based gear, a “enable” followed by a “show run”. It then stores the results in a git based structure and provides a web interface to compare different backups. It allows us to backup our configurations throughout our design process. We usually don’t run it during our event as we do not change (hopefully…) any configurations during the weekend.

Grafana

We have just recently started using Grafana as a visualization tool for our performance data. It uses a InfluxDB data-source to display data in a much more friendly and dynamic way than the RRD format. It’s an excellent tool to display interface stats and traffic.

Splunk

Splunk is our log aggregation tool. We use it as a central collection point for network equipment and server logs. We use it to create dashboards to highlight important events such as port-security events or unauthorized login attempts against our network. By default, most equipment logging can be very verbose and it’s important to properly filter events to pinpoint what is important