Analyze NGINX Ingress traffic with LogAnalytics
Posted 16 Nov 2022
The most used ingress for Kubernetes is the default NGINX ingress. It works well with Azure Kubernetes Service (AKS) but it's not that well integrated with the Azure ecosystem.
NGINX is commonly monitored using Prometheus, but I don't want to go down that path here and stick with what AKS gives us out of the box: container logs in Azure LogAnalytics.
Ingress monitoring
The following metrics are commonly used for monitoring and alerting:
- Data volume
- Number of requests
- Failed requests
- Request duration
Usually, it's helpful to split these metrics by service, method, or URL.
Most of this data can be derived from the container logs, where the NGINX ingress will log all traffic by default. The container logs are streamed to an Azure LogAnalytics Workspace where they can be queried with Kusto.
Check Container Insights for the AKS cluster if you need memory and cpu usage merics for the ingress workloads.
Finding the right container logs
Container logs are lumped into one big bucket. To filter for the relevant entries we first need to know what containers were handling ingress requests. Using the KubePodInventory table we can figure out what containers were running for a specific pod. In this case, we're looking in the production namespace for pods that have ingress in the name.
// Find the ingress containers in the production namespace
let containers = KubePodInventory
| where Name has "ingress" and Namespace == 'production' // adjust this filter as needed
| extend Deployment = trim_end("-[a-f0-9]+", ControllerName)
| distinct Deployment, ControllerName, ContainerName, ContainerID, Namespace;
This query takes a bit of a shortcut by deriving the deployment name from the controller name.
Using this list of containers we can dig into the container logs and filter for the right containers:
// Get all weblog entries from the container logs
let logs = ContainerLog
| where TimeGenerated > ago (24h) // adjust as needed
| where LogEntrySource != 'stderr'
| join kind=inner (containers) on ContainerID
Parsing the logs
The NGINX log format is described on the Kubernetes site.
// Parse the formatted line into columns
| parse kind=relaxed LogEntry with remote_addr '- ' remote_user ' [' localtime '] "' request '" ' status ' ' body_bytes_sent:int ' "' http_referrer '" "' http_user_agent '" ' request_length:int ' ' request_time:real ' [' proxy_upstream_name '] [' proxy_alternative_upstream_name '] ' upstream_addr ' ' upstream_response_length:int ' ' upstream_response_time:real ' ' upstream_status:int ' ' req_id
// Parse the request into method, path, and protocol
| parse kind=regex request with method ' ' path ' ' protocol
// Clean up the upstream name so we can aggregate by service
| extend service=trim_end("-[a-f0-9]+", trim_start('production-', proxy_upstream_name))
// Select only the fields we need to reduce the data volume
| project TimeGenerated, Deployment, remote_addr, remote_user, localtime, request, method, path, protocol, status, body_bytes_sent, http_referrer, http_user_agent, request_length, request_time, upstream_addr, service, upstream_response_length, upstream_response_time, upstream_status
| order by TimeGenerated desc;
Let's dig into this query to see what it does.
Parse is a powerful operator that can expand a line of text into data. It handles a lot of work but we need multiple passes to get everything we need. It's quite a bit of work to get this right and can be really difficult when rows are not uniform. Fortunately, the NGINX logs are quite well structured.
In the first parse we explode the formatted line into separate values, assigning types where possible. For example, request_time
is a floating point number indicating the request duration while request_length
indicates the length of the request in bytes. We need to have the right data type to be able to aggregate later. For example, average request duration or total bytes transferred.
Next, the request
field is parsed again to get the method, path, and protocol.
To add the service name for the request the row is extended with the trimmed-down version of the proxy_upstream_name
field.
Finally, we get rid of all the raw data using the project operator. This helps optimize performance in the next steps.
Field | Description |
TimeGenerated | The moment in time the log line was captured |
Deployment | Ingress name |
remote_addr | The IP address of the upstream connection, usually the internal IP of the LoadBalancer that forwarded the traffic. |
remote_user | User name supplied with the Basic authentication |
localtime | The time of the request in the Common Log Format. This is a string but may be worth parsing if better precision is needed than the TimeGenerated field provides. For example: 16/Nov/2022:21:55:46 +0000 |
request | Full original request line (method path protocol) |
method | HTTP method, for example: POST (parsed from the request field) |
path | Request path and query, for example: /api/test?q=abc (parsed from the request field) |
protocol | Protocol, for example: HTTP/2.0 (parsed from the request field) |
status | HTTP status returned to the client, for example: 200 or 404 |
body_bytes_sent | The size of the request body in bytes, excluding headers |
http_referrer | The referrer URL, if available |
http_user_agent | The user agent. Use this to distinguish organic from synthetic traffic . For example: Mozilla/5.0 (Windows NT 10.0 Win64 x64) ...etc |
request_length | Request length in bytes including request line, header, and request body |
request_time | Time elapsed in seconds since the first bytes were read from the client |
upstream_addr | The IP address the request was forwarded to |
service | The Kubernetes services that handled the request (parsed from the proxy_upstream_name field) |
upstream_response_length | The length of the response in bytes received from the service |
upstream_response_time | The time in seconds it took for the service to respond |
upstream_status | HTTP response status returned by the service |
Remove fields you do not need to speed up the query.
Analyzing traffic
With this base data set, we can start analyzing traffic. Use the base query defined so far and add projections and aggregations as needed. Make sure to only include fields you need to limit the processing time.
Requests by service over time
This returns the number of requests by service across all ingresses included in the data per hour. The bin function is used to set the time interval. This kind of data is best displayed in a stacked bar chart.
// include base queries
logs
| project TimeGenerated, service
| summarize count() by service, bin(TimeGenerated, 1h)
Inbound data volume per ingress over time
Summarize the incoming data volume in bytes (from the request_length field) per ingress (from the deployment field) per 15 minutes.
// include base queries
logs
| project TimeGenerated, Deployment, request_length
| summarize sum(request_length) by Deployment, bin(TimeGenerated, 15m)
Average response time by service
Capacity problems usually show up as slow responses. This query takes the average response time for all requests to a service (in seconds) per 15 minutes. This type of data is best displayed in a line chart so any spikes or trends become visible.
// include base queries
logs
| project TimeGenerated, service, request_time
| summarize avg(request_time) by service, bin(TimeGenerated, 15m)
Failed requests
I use two views on failed requests. First on a global level, by counting requests and splitting them by status code. This can help spot infrastructure problems. Note that this query uses the status code returned by the ingress here, from the status field.
// include base queries
logs
| summarize count() by tostring(status), bin(TimeGenerated, 15m)
The second view shows the failed requests per service. A failed request, by HTTP standards, is any response code greater than 399. If you have a public site there will always be 404 responses so it may be good to filter that out. This query looks at service health so the upstream_status is preferable here.
// include base queries
logs
| where upstream_status > 399 and upstream_status != 404
| summarize count() by service, bin(TimeGenerated, 15m)
Malicious requests
While processing data with these queries I noticed some traffic was a bit off. Some of it looked like attempts at exploits and some of it was due to poorly implemented 3rd party software trying to access the system.
Even if you have a Web Application Firewall (WAF) in front of your ingress some bad traffic will slip through or may even come from inside your network. The following query captures most of these requests:
// include base queries
logs
| where service == ''or method == ''or protocol !startswith('HTTP')
| order by TimeGenerated desc
This query is by no means perfect but it's good enough to get an idea of what the outside world is throwing at you.
Final thoughts
The container logs for the NGINX ingress contain a lot of details that are great for traffic analysis. Do keep in mind that this type of processing can be slow. Try to make keep the time window small: 24 hours or less.
If process a lot of requests it's probably better to look at integrating Prometheus to get performance metrics.