Logging and monitoring

The OWASP Top 10 2017 introduces the risk of insufficient logging and monitoring. Indeed, inherent problems in this practice are often underestimated and misunderstood. But why is a seemingly simple task ending up being a crucial point of information system security? 

1/ Let’s Define Logging

Logging is a term meaning the management of logs. Logs are event records where events related to the state of a system are collected.  There are a multitude of logs for different systems. 

Let’s take the example of a web application: logs can be any action performed on the web service, such as a user’s connection to the platform, an HTTP error generation or an access to a resource on the server.
A large amount of data is quickly collected, which implies an important material and human cost. In addition, for logs to be useful, they require the following actions:  

  • Selecting useful information to store and archive
  • Ensuring the security and confidentiality of stored logs
  • Controlling the quality of log data by analysing and adding missing information to the logs
  • Analysing logs (often confused with application monitoring)
  • Contextualizing events (log enrichment) with
    • IP address that generated the logs;
    • User concerned
    • Feature concerned
    • Error detail

The contextualization of the logs is the part that requires the most experience and knowledge of the monitored system, to know which information should be retained and which is useless. This task also requires a lot of human time. 

Once all these actions have been performed, the logs will allow investigating a malfunction in the application so that it does not happen again. In the case of an attack, this will make it possible to know the actors at the origin of the incident. Moreover, it will be possible to know which functionality has been abused, in order to correct the flaw that allowed the attack. 

2/ Let’s Define Monitoring

Monitoring, or supervision of an application, is the ability to have a global view of an application at a given moment but also a history of past states for several elements: 

  • Performance, response time of the different server resources;
  • Integrity, checking that the content of web pages does not change;
  • Availability, verifying that the application is fully functional (UP/DOWN). 

Monitoring is also important to detect any lack of server performance and to detect attacks in real time. Indeed, if a server requires high availability, monitoring user actions allows to identify which functionality of the application requires a lot of resources and would be likely to cause slowdowns. For an attack, if a large number of connections are coming to the service, a denial of service attempt may be in progress. An alert could allow the security team to react, for example, by blocking IP addresses making partial TCP connections or too many TCP connections too quickly.

In order to detect these anomalies, a global supervision tool must be used to centralize the different logs. This tool needs to interrogate in real time the services to be monitored. It can be based on multiple elements, called metrics, such as: 

  • CPU load;
  • Number of simultaneous connections (TCP, UDP, application…);
  • Server errors;
  • Simulation of an interaction with the application;
  • Network load (QOS, latency, ping);
  • Attempts of connections blocked by the firewall (Nmap detection).

The supervision of these elements must allow the creation of events (alerts). These elements are significant state changes. This can be a too high CPU load, a push to a repository, a build error, a too high number of simultaneous TCP connections. For an efficient follow-up, it is then necessary to set criticality levels on the events. This allows you to process them in order of priority, as in a ticket management application. 

Logging and monitoring are often considered the same, because the monitoring system has logs as its main data, and without quality logs, there is no effective monitoring. However, log analysis should not be confused with monitoring. Log analysis is post-incident work, while monitoring is permanent work. 

3/ Why is the lack of logging and monitoring a vulnerability?

As we have just seen, the implementation of such techniques can be very complex. Indeed, one must be able to store, sort and process this information. Without a good knowledge of the elements to be monitored, several problems can occur:

  • Unlogged state change.
  • Logging only system errors, which does not allow to deal with all problems. However, if you start logging too many items, a storage space problem can quickly occur. It is therefore necessary to know what is important to log and what is less important (setting up a log classification). In this way, the storage time of the information can be adapted according to its criticality.
  • Too much information to search in.
  • Lack of correlation between data. If several micro-services are monitored together, it is possible that the individual elements may not make sense. Human work is therefore required to link this information (contextualization of the logs).
  • Wrong configuration of events. As a result, some alerts go unnoticed by the monitoring team.

The accumulation of these problems makes the logs unusable. The monitoring systems then become more of a constraint and a waste of time than a help. This is known as insufficient logging and monitoring, which can quickly become a big problem and an important vulnerability. 

Once logging management is no longer efficient, it actually becomes complicated for the development team to detect a problem before the impact is significant. An attacker could therefore hide inside an application or a system without being detected before he performs harmful actions. 

Indeed, the majority of computer attacks could be anticipated and/or stopped if the logging and monitoring systems are correctly configured. There are a multitude of real cases that demonstrate the danger of such a vulnerability.
For example, the company Citrix providing a digital workspace platform found out that attackers were infiltrating their network only 6 months after their intrusion (from October 2018 to March 2019). This allowed them to steal and delete employee files (name, social security number and financial information). The intrusion was carried out by brute-force attacks on passwords of user accounts (source). This type of attack could have been detected much earlier if a monitoring system had detected a large number of erroneous password attempts. 
It is therefore important to select the right information so as not to be drowned by alerts, that would be otherwise ignored.

4/ Logging & Monitoring Best Practices

We have seen how complex it is to set up an efficient logging and monitoring system. In order to help you on this point, we are presenting you some best practices to put in place to facilitate the implementation and increase the efficiency of such systems. 

  • Know your metrics: (What can I log or not?). This requires to know your system well in order to know what to measure. We can differentiate between two types of metrics. This definition must be taken into account from the design of the application.
    • Business metrics, which refers to the application. This can be the percentage of abandoned shopping cart on an e-commerce site, or the number of pages served by the server per second. 
    • Resource metrics. This relates to the resources implemented to serve the application. This could be the number of active CPUs, or the RAM requested by the server at a time T.
  • Select your metrics (what I have to log): To do this, it is first recommended to classify your metrics, from the least important (information useless for the health of your application) to the most important (critical information).
    • Based on this classification, select the metrics to monitor from the most important to the least important. Depending on the resources allocated to the monitoring server (disk space, RAM), you can select more or less metrics. This classification will allow you to return to the metrics selection if the monitoring service is allocated more resources in the future.
  • Define your alerts: The alerts raised by the supervision system should only apply to the metrics you have just selected. You must now define for each metric from which threshold an alert is raised.
  • Classify your alerts (informational, low, medium, important, critical): Depending on their classifications, certain actions can be triggered.
    • Informational: no notification is raised.
    • Low: a notification is created in the monitoring system.
    • Medium: an email is sent to the person in charge of the involved resource.  
    • Important: an email and an SMS are sent to the person in charge of the involved resource.
    • Critical: a call is made to the technical manager of the application in addition to an email.  
  • Manage and classify metrics: It’s an iterative process. Like agile development, log rules management is not a definitive thing. You have to come back to these rules each time you implement features. Moreover, if an element regularly raises an alert that is not useful (false positive), the classification must be reviewed.
  • Separate the logs: The logs of a web application should only contain information related to the application’s functionality and not problems inherent to the server.
  • Define a log structure: Logs must have a format that allows easy exfiltration of useful information. The JSON or key->value format are often adequate. 
  • Centralize the logs: Supervision must be centralized and managed by an external server to the application. This allows the system retrieving all the information to add a context to this information and to correlate it. Correlation is very important to quickly identify the problem. For example, the application generates a large number of 500 errors on a feature. At the same time, the server sees its CPU load increase considerably. Taken independently, it is difficult to know the origin of the problem of the increased CPU load. However, correlating these two pieces of information tells us that the generation of 500 errors on a feature involves a considerable server load, which can cause a denial of service of the application. Moreover, the context of the logs will allow to detect what generates a 500 error and thus manage this exception in the code.
  • Having a duplication of supervision: Indeed, if the monitoring server is out of order (disk space saturation, updates…), a backup plan must be planned. Depending on the company’s means, this can be a duplication of the server with hard disks in RAID5 or a degraded solution (storage of raw information on a hard disk external to the system). 
  • Choose a logging framework corresponding to the infrastructure of the application.
  • Document the infrastructure managing the logs. 
  • Evaluate its logging and monitoring system.

5/ Evaluating its Logging and Monitoring System

Once the various systems have been put in place, it is now necessary to evaluate their effectiveness. 

A very simple first indicator to check is the fact that no alerts are raised for a long time. Indeed, there is always an anomaly to be reported even if it is a simple piece of information. Moreover, if the information system has a known problem and the monitoring system does not raise any alert, there is necessarily a problem with the system configuration. 

A good test to perform would be to run a vulnerability scanner such as OpenVas or Burp on your server and application. This type of scanner should raise a multitude of alerts. Moreover, depending on the tests they perform, they will allow you to add information to the alerts that are raised. For example, if you configure a scanner to test command injection on a feature, the alerts that are raised for abuse of functionality could be classified as command injection attempts.

Once these internal tests and adjustments are done, one or more application penetration tests are very good tests for your monitoring systems. Indeed, they often enable to highlight potential problems. However, it is not possible for a pentester to evaluate whether the audited company conscientiously performs the log management and supervision of the audited server. Generally speaking, all actions taken by a pentest must raise alerts on your system.

To be exhaustive, it is possible to perform an internal audit in white box mode: in this case, the auditor could have access to the entire infrastructure and verify in real time that, based on the tests performed, the corresponding alerts are raised.