Developer Notes

An Aerial Shot of a Boat going through an Inlet, photo by Pok Rie / Pexels

Analyze NGINX Ingress traffic with LogAnalytics

The most used ingress for Kubernetes is the default NGINX ingress. It works well with Azure Kubernetes Service (AKS) but it's not that well integrated with the Azure ecosystem.

NGINX is commonly monitored using Prometheus, but I don't want to go down that path here and stick with what AKS gives us out of the box: container logs in Azure LogAnalytics.

Ingress monitoring

The following metrics are commonly used for monitoring and alerting:

  • Data volume
  • Number of requests
  • Failed requests
  • Request duration

Usually, it's helpful to split these metrics by service, method, or URL.

Most of this data can be derived from the container logs, where the NGINX ingress will log all traffic by default. The container logs are streamed to an Azure LogAnalytics Workspace where they can be queried with Kusto.

Finding the right container logs

Container logs are lumped into one big bucket. To filter for the relevant entries we first need to know what containers were handling ingress requests. Using the KubePodInventory table we can figure out what containers were running for a specific pod. In this case, we're looking in the production namespace for pods that have ingress in the name. 

// Find the ingress containers in the production namespace
let containers = KubePodInventory
    | where Name has "ingress" and Namespace == 'production' // adjust this filter as needed
    | extend Deployment = trim_end("-[a-f0-9]+", ControllerName)
    | distinct Deployment, ControllerName, ContainerName, ContainerID, Namespace;

This query takes a bit of a shortcut by deriving the deployment name from the controller name.

Using this list of containers we can dig into the container logs and filter for the right containers:

// Get all weblog entries from the container logs
let logs = ContainerLog
    | where TimeGenerated > ago (24h) // adjust as needed
    | where LogEntrySource != 'stderr'
    | join kind=inner (containers) on ContainerID

Parsing the logs

The NGINX log format is described on the Kubernetes site.

// Parse the formatted line into columns
    | parse kind=relaxed LogEntry with remote_addr '- ' remote_user ' [' localtime '] "' request '" ' status ' ' body_bytes_sent:int ' "' http_referrer '" "' http_user_agent '" ' request_length:int ' ' request_time:real ' [' proxy_upstream_name '] [' proxy_alternative_upstream_name '] ' upstream_addr ' ' upstream_response_length:int ' ' upstream_response_time:real ' ' upstream_status:int ' ' req_id
// Parse the request into method, path, and protocol
    | parse kind=regex request with method ' ' path ' ' protocol
// Clean up the upstream name so we can aggregate by service
    | extend service=trim_end("-[a-f0-9]+", trim_start('production-', proxy_upstream_name))
// Select only the fields we need to reduce the data volume
    | project TimeGenerated, Deployment, remote_addr, remote_user, localtime, request, method, path, protocol, status, body_bytes_sent, http_referrer, http_user_agent, request_length, request_time,  upstream_addr, service, upstream_response_length, upstream_response_time, upstream_status
    | order by TimeGenerated desc;

Let's dig into this query to see what it does.

Parse is a powerful operator that can expand a line of text into data. It handles a lot of work but we need multiple passes to get everything we need. It's quite a bit of work to get this right and can be really difficult when rows are not uniform. Fortunately, the NGINX logs are quite well structured.

In the first parse we explode the formatted line into separate values, assigning types where possible. For example, request_time is a floating point number indicating the request duration while request_length indicates the length of the request in bytes. We need to have the right data type to be able to aggregate later. For example, average request duration or total bytes transferred.

Next, the request field is parsed again to get the method, path, and protocol.

To add the service name for the request the row is extended with the trimmed-down version of the proxy_upstream_name field.

Finally, we get rid of all the raw data using the project operator. This helps optimize performance in the next steps.

TimeGeneratedThe moment in time the log line was captured
DeploymentIngress name
remote_addrThe IP address of the upstream connection, usually the internal IP of the LoadBalancer that forwarded the traffic.
remote_userUser name supplied with the Basic authentication
localtimeThe time of the request in the Common Log Format. This is a string but may be worth parsing if better precision is needed than the TimeGenerated field provides.
For example: 16/Nov/2022:21:55:46 +0000
requestFull original request line (method path protocol)
methodHTTP method, for example: POST
(parsed from the request field)
pathRequest path and query, for example: /api/test?q=abc
(parsed from the request field)
protocolProtocol, for example: HTTP/2.0
(parsed from the request field)
statusHTTP status returned to the client, for example: 200 or 404
body_bytes_sentThe size of the request body in bytes, excluding headers
http_referrerThe referrer URL, if available
http_user_agentThe user agent. Use this to distinguish organic from synthetic traffic .
For example: Mozilla/5.0 (Windows NT 10.0 Win64 x64)...etc
request_lengthRequest length in bytes including request line, header, and request body
request_timeTime elapsed in seconds since the first bytes were read from the client
upstream_addrThe IP address the request was forwarded to
serviceThe Kubernetes services that handled the request (parsed from the proxy_upstream_name field)
upstream_response_lengthThe length of the response in bytes received from the service
upstream_response_timeThe time in seconds it took for the service to respond
upstream_statusHTTP response status returned by the service

Analyzing traffic

With this base data set, we can start analyzing traffic. Use the base query defined so far and add projections and aggregations as needed. Make sure to only include fields you need to limit the processing time.

Requests by service over time

This returns the number of requests by service across all ingresses included in the data per hour. The bin function is used to set the time interval. This kind of data is best displayed in a stacked bar chart.

// include base queries
| project TimeGenerated, service
| summarize count() by service, bin(TimeGenerated, 1h)

Inbound data volume per ingress over time

Summarize the incoming data volume in bytes (from the request_length field) per ingress (from the deployment field) per 15 minutes. 

// include base queries
| project TimeGenerated, Deployment, request_length
| summarize sum(request_length) by Deployment, bin(TimeGenerated, 15m)

Average response time by service

Capacity problems usually show up as slow responses. This query takes the average response time for all requests to a service (in seconds) per 15 minutes. This type of data is best displayed in a line chart so any spikes or trends become visible.

// include base queries
| project TimeGenerated, service, request_time
| summarize avg(request_time) by service, bin(TimeGenerated, 15m)

Failed requests

I use two views on failed requests. First on a global level, by counting requests and splitting them by status code. This can help spot infrastructure problems. Note that this query uses the status code returned by the ingress here, from the status field.

// include base queries
| summarize count() by tostring(status), bin(TimeGenerated, 15m)

The second view shows the failed requests per service. A failed request, by HTTP standards, is any response code greater than 399. If you have a public site there will always be 404 responses so it may be good to filter that out. This query looks at service health so the upstream_status is preferable here.

// include base queries
| where upstream_status > 399 and upstream_status != 404
| summarize count() by service, bin(TimeGenerated, 15m)

Malicious requests

While processing data with these queries I noticed some traffic was a bit off. Some of it looked like attempts at exploits and some of it was due to poorly implemented 3rd party software trying to access the system.

Even if you have a Web Application Firewall (WAF) in front of your ingress some bad traffic will slip through or may even come from inside your network. The following query captures most of these requests:

// include base queries
| where service == ''or method == ''or protocol !startswith('HTTP')
| order by TimeGenerated desc

This query is by no means perfect but it's good enough to get an idea of what the outside world is throwing at you.

Final thoughts

The container logs for the NGINX ingress contain a lot of details that are great for traffic analysis. Do keep in mind that this type of processing can be slow. Try to make keep the time window small: 24 hours or less.

If process a lot of requests it's probably better to look at integrating Prometheus to get performance metrics.