Linux offers a rich set of debugging tools that allow the kernel to be instrumented, enabling a better understanding of its internals.
Today, we’d like to show you a way to understand packet drops without the need for any additional tools to be installed.
Suppose you need to understand why your machine is discarding packets, the first tools you might think about are tools like perf
1 and dropwatch
2.
However, what if none of these tools are installed, and you cannot install anything extra?
What if all we have is a shell?
In that case, it’s time to lol, living off the land.
Let’s consider a scenario where we need to understand where and at which point in the Linux network stack packets are being dropped. This is a task that senior system administrators sometimes have to tackle, and it’s also essential for pentesters when assessing the effectiveness of crafted packets.
Before we delve into the details, let’s clarify a few key aspects of packets within Linux’s network stack. In the kernel, a packet is referred to as a “socket buffer”3, commonly abbreviated as skbuff or just skb. A skb is a versatile data structure capable of holding various types of network packets. skbuffs are passed through the network stack both upwards and downwards. Different components in the stack can choose to pass the skb along, modify it, consume it, or release it.
Technically, both consuming and releasing it mean returning the buffer’s memory to the allocator and destroying the socket buffer data structure. Linux distinguishes between successful deliveries and error conditions where the network stack gives up on a particular packet. In this context, we’re concerned with the latter scenario.
A debugging gem hidden deep within the Linux kernel is the function tracer4, often referred to as “ftrace.”
Normally, it acts as the backbone for high-level tools like trace-cmd
5 or perf
.
However, even when you have virtually nothing installed, you can still leverage the function tracer through a filesystem-based interface.
This interface is provided by tracefs
, which is typically mounted at /sys/kernel/debug/tracing
.
This assumes that the function trace feature is enabled in the kernel configuration, which is usually the case.
One notable feature of ftrace is that it imposes zero runtime overhead when disabled.
While ftrace certainly merits its own extensive blog series due to its wealth of functionality, for now, let’s keep it minimal and focused.
For the purposes of this blog post, all file paths are considered relative to /sys/kernel/debug/tracing/
.
ftrace records all events into a trace buffer, and you can access this buffer by reading from the file named trace_pipe
.
A straightforward cat trace_pipe
command will provide a continuous stream of events if any are present.
However, it’s important to note that if there are no events, the cat
command will block and wait for events to appear.
Based on our exploration of the kernel source code, we’ve already identified a function called kfree_skb()
, which is responsible for destroying a socket buffer (skbuff).
Our expectation is that this function is primarily used for non-consuming destruction, which means it’s applied every time the network stack gives up on a packet, resulting in the packet being discarded.
Now, let’s search for the function using ftrace:
$ grep -e kfree_skb available_filter_functions
kfree_skbmem
kfree_skb_reason
kfree_skb_list_reason
__kfree_skb
kfree_skb_partial
__kfree_skb_defer
__dev_kfree_skb_irq
__dev_kfree_skb_any
rtnl_kfree_skbs
__traceiter_kfree_skb
trace_kfree_skb_hit
net_dm_packet_trace_kfree_skb_hit
Nice! There is a __kfree_skb
, sounds good.
Let’s tell ftrace to trace every invocation of this function and arm the function tracer.
$ echo __kfree_skb > set_ftrace_filter
$ echo function > current_tracer
$ echo > trace # just in case, clear old contents in the buffer
$ cat trace_pipe
napi/-8411-1777 [006] b..2. 713195.705113: __kfree_skb <-tcp_ack
napi/-8411-1777 [006] b..2. 713195.705120: __kfree_skb <-tcp_rcv_established
[.. over and over ...]
napi/-8411-1777 [006] b..2. 713229.810263: __kfree_skb <-tcp_ack
napi/-8411-1777 [006] b..2. 713229.810271: __kfree_skb <-tcp_rcv_established
Dang! That’s a lot of output.
Looks like __kfree_skb()
is always called for both consuming and non-consuming releases of an skb.
Let’s disable ftrace again.
$ echo nop > current_tracer
$ echo > set_ftrace_filter
$ echo > trace
Continuing our exploration of the network code, we’ve discovered the function kfree_skb_reason()
.
This function utilizes __kfree_skb()
but takes a crucial function parameter: the “reason”.
This appears to be precisely what we’re looking for.
The reason is of type enum skb_drop_reason
as defined in include/net/dropreason-core.h
.
These reasons encompass values such as SKB_CONSUMED
, SKB_DROP_REASON_NO_SOCKET
, and SKB_DROP_REASON_NETFILTER_DROP
.
Our goal is to trace every invocation of kfree_skb_reason()
when the reason is not equal to SKB_CONSUMED
.
This will help us monitor situations where the network stack abandons a packet for reasons other than consumption.
In our quest to trace and analyze events deeper in the kernel, we encounter a limitation with ftrace’s basic function tracing.
It permits us to filter events based on the function name but doesn’t provide access to function arguments.
Moreover, we only observe function calls without their associated arguments, making a simple grep -v
on the trace_pipe
file insufficient.
To overcome this limitation, we can delve deeper into ftrace’s capabilities, particularly by leveraging custom probes known as “kprobes”6. A kprobe essentially serves as a probe point within the kernel and can be set at various locations, including function entry, function return, or anywhere within a function. By probing at the function entry, we gain access to the function’s arguments, allowing for more detailed analysis and tracing.
Manually installing a new kprobe is a bit cumbersome.
In our case, we want to probe the function kfree_skb_reason()
.
To set up the kprobe, we need to inform ftrace how to access the function’s arguments, which can be machine-specific.
In this example, we’ll use an x86_64 laptop.
On this architecture, the Linux kernel adheres to the System V AMD64 ABI, where the first function argument is stored in the DI
register,
and the second one is stored in the SI
register.
Setting up a kprobe requires specifying a name and a format string. We can use the register names in the format string to access the function’s arguments for analysis.
$ echo 'p:my_skb_probe kfree_skb_reason skb=%di:u64 reason=%si:u64' > kprobe_events
After this command a new directory named my_skb_probe
will show up in events/kprobes/
.
Let’s arm it and read from the trace buffer.
$ echo 1 > events/kprobes/my_skb_probe/enable
$ echo > trace
$ cat trace_pipe
ksoftirqd/1-23 [001] ..s.. 714457.274916: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617083415795712 reason=2
<idle>-0 [001] ..s1. 714457.290757: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617083415795200 reason=2
<idle>-0 [001] ..s1. 714457.306989: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617083415797504 reason=2
<idle>-0 [001] ..s1. 714457.323109: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617064304559104 reason=2
<idle>-0 [001] .Ns1. 714457.342760: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617064304564736 reason=2
<idle>-0 [001] .Ns1. 714457.358760: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617064304561664 reason=2
<idle>-0 [010] ..s1. 714457.366230: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=0 reason=2
kworker/u32:33-28725 [004] ..... 714457.366293: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617057572911104 reason=2
<idle>-0 [001] .Ns1. 714457.374797: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617064304564992 reason=2
While we’ve made progress, there are still many events. However, we can now clearly see the reasons, with reason number 2
representing SKB_DROP_REASON_NOT_SPECIFIED
.
To filter the events more effectively using ftrace, we can instruct it to only emit events when the reason is larger than 2
. We use “larger” because values 0
and 1
correspond to SKB_NOT_DROPPED_YET
and SKB_CONSUMED
.
This approach will allow us to focus on events where the network stack discards packets for reasons other than these two.
$ echo "reason > 2" > events/kprobes/my_skb_probe/filter
$ echo > trace
$ cat trace_pipe
napi/-8412-1778 [014] b.... 714699.253491: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617078699127552 reason=3
Excellent, by filtering for events with a reason larger than 2
, we’ve successfully narrowed down the events to a more manageable and meaningful set.
Now we’re able to focus on events that truly matter.
Reason number 3
corresponds to SKB_DROP_REASON_NO_SOCKET
, which, in your case, is associated with your test program attempting to connect to a closed port using telnet.
While it’s beneficial to see when Linux drops a packet, the events themselves provide limited information, primarily just the reason number. To gain deeper insights into the process and understand the path the skbuff took through the network stack, the next step is to print a stack trace. This will enable us to visualize the call chain and, hopefully, learn more about the skbuff’s journey within the network stack.
Every event in ftrace can be equipped with triggers, and one trigger type involves printing a stack trace.
$ echo 'stacktrace if reason > 2' > events/kprobes/my_skb_probe/trigger
$ echo > trace
$ cat trace_pipe
napi/-8412-1778 [014] b.... 715083.709832: my_skb_probe: (kfree_skb_reason+0x0/0x110) skb=18446617059302153472 reason=3
napi/-8412-1778 [014] b..1. 715083.709868: <stack trace>
=> kprobe_trace_func
=> kprobe_dispatcher
=> kprobe_ftrace_handler
=> create_port
=> kfree_skb_reason
=> tcp_v4_rcv
=> ip_protocol_deliver_rcu
=> ip_local_deliver_finish
=> __netif_receive_skb_core.constprop.0
=> __netif_receive_skb_list_core
=> netif_receive_skb_list_internal
=> napi_complete_done
=> ath11k_pcic_ext_grp_napi_poll
=> __napi_poll
=> napi_threaded_poll
=> kthread
=> ret_from_fork
Now we see that the dropped packet arrived through the ath11k WiFi driver, and the function tcp_v4_rcv()
decided to drop it on the floor!
To disable the custom probe run:
$ echo 0 > events/kprobes/my_skb_probe/enable
$ echo '!stacktrace if reason > 2' > events/kprobes/my_skb_probe/trigger
$ echo > kprobe_events
Delving into the intricacies of kernel internals and machine-specific function call ABI details can be quite challenging and may not be necessary for everyone. Fortunately, Linux provides a more accessible solution through predefined kprobes known as tracepoints. These tracepoints are strategically placed at crucial locations in the kernel where regular function tracing might not provide sufficient context. In the case of the Linux network stack, the maintainers have thoughtfully provided tracepoints for various skbuff operations. This makes it easier for users to access and gather meaningful information without the need to delve too deeply into the kernel’s internals.
$ ls -1 events/skb
consume_skb
enable
filter
kfree_skb
skb_copy_datagram_iovec
consume_skb
and kfree_skb
look promising! In the event directory is also a file called format
which describes the event.
$ cat events/skb/kfree_skb/format
name: kfree_skb
[.. skipped output ..]
field:void * skbaddr; offset:8; size:8; signed:0;
field:void * location; offset:16; size:8; signed:0;
field:unsigned short protocol; offset:24; size:2; signed:0;
field:enum skb_drop_reason reason; offset:28; size:4; signed:0;
[.. skipped output ..]
The event can provide us with the drop reason. Let’s enable it and filter for reason > 2
$ echo 1 > events/skb/kfree_skb/enable
$ echo "reason > 2" > events/skb/kfree_skb/filter
$ echo > trace
$ cat trace_pipe
napi/-8411-1777 [006] b.... 715880.566050: kfree_skb: skbaddr=000000002d74f5c5 protocol=2048 location=tcp_v4_rcv+0x7d/0xf20 reason: NO_SOCKET
The reason has even a symbolic name and we immediately see the caller! Having a stack trace is of course also possible.
$ echo 'stacktrace if reason > 2' > events/skb/kfree_skb/trigger
$ cat trace_pipe
napi/-8411-1777 [006] b.... 715962.431448: kfree_skb: skbaddr=0000000096888eda protocol=2048 location=tcp_v4_rcv+0x7d/0xf20 reason: NO_SOCKET
napi/-8411-1777 [006] b..1. 715962.431472: <stack trace>
=> trace_event_raw_event_kfree_skb
=> kfree_skb_reason
=> tcp_v4_rcv
=> ip_protocol_deliver_rcu
=> ip_local_deliver_finish
=> __netif_receive_skb_core.constprop.0
=> __netif_receive_skb_list_core
=> netif_receive_skb_list_internal
=> napi_complete_done
=> ath11k_pcic_ext_grp_napi_poll
=> __napi_poll
=> napi_threaded_poll
=> kthread
=> ret_from_fork
Why didn’t we immediately demonstrate the easy way using tracepoints? Because learning something the hard way helps in understanding the basics. Additionally, not all interesting areas in the kernel have matching tracepoints, so sooner or later, you may need to add your own kprobes.
Even with no tracing-specific tooling installed, Linux allows for advanced debugging using a file-based interface. Using the function trace, ftrace, it’s possible to inspect the kernel’s internals easily. In this example, we used ftrace to understand how and where network packets are dropped.
Publish date
02.11.2023
Category
linux
Authors
Richard Weinberger
+43 5 9980 400 00 (email preferred)
sigma star gmbh
Eduard-Bodem-Gasse 6, 1st floor
6020 Innsbruck | Austria