A tale of bonding interfaces, Linux kernel debugging, eBPF tools and netlink messages.
How it started
The other day a colleague and I were experimenting with bonding interfaces on a Debian
Bullseye machine. At one point we tried restarting the networking service
(systemctl restart networking
) and noticed that the MAC address used by the
bond0
interface was changed.
This felt weird. Why would a plain restart
of the networking not result in a
consistent outcome when compared with the state after a machine reboot?
Preparing a machine for testing
To see it in action all you need to do is to install ifenslave
and then modify /etc/network/interfaces
to include bond settings as described in the Official Debian Configuration Example (of course with some tweaks to match your available interfaces etc). In my case it ended up looking like so:
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
source /etc/network/interfaces.d/*
# The loopback network interface
auto lo
iface lo inet loopback
auto enp0s3
# The primary network interface
allow-hotplug enp0s3
iface enp0s3 inet dhcp
auto bond0
iface bond0 inet static
address 10.31.1.5
netmask 255.255.255.0
bond-slaves enp0s8
bond-mode active-backup
bond-miimon 100
bond-downdelay 200
bond-updelay 200
... where enp0s3
is the interface I am reaching the machine over (so just
ignore it below, is is not interesting for our purposes) and
enp0s8
is an unused interface that will act as the bonding slave.
Setting the stage
Now, upon restarting the machine you will see something like this:
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
link/ether 08:00:27:19:f3:3e brd ff:ff:ff:ff:ff:ff
inet 192.168.1.127/24 brd 192.168.1.255 scope global dynamic enp0s3
valid_lft 43128sec preferred_lft 43128sec
inet6 fe80::a00:27ff:fe19:f33e/64 scope link
valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UNKNOWN group default qlen 1000
link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff permaddr 08:00:27:e2:37:5f
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff
inet 10.31.1.5/24 brd 10.31.1.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::40e2:89ff:fe5b:6dde/64 scope link
valid_lft forever preferred_lft forever
The important detail is that bond0
and enp0s8
has a made up MAC address of
42:e2:89:5b:6d:de
, you can also see enp0s8
has a permaddr 08:00:27:e2:37:5f
field, signifying that it is currently having it's MAC
address overwritten by the made up address on bond0
.
Now, watch what happens if we restart the networking:
# systemctl restart networking
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
link/ether 08:00:27:19:f3:3e brd ff:ff:ff:ff:ff:ff
inet 192.168.1.127/24 brd 192.168.1.255 scope global dynamic enp0s3
valid_lft 43198sec preferred_lft 43198sec
inet6 fe80::a00:27ff:fe19:f33e/64 scope link
valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UNKNOWN group default qlen 1000
link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
inet 10.31.1.5/24 brd 10.31.1.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fee2:375f/64 scope link
valid_lft forever preferred_lft forever
Now the MAC address of bond0
has changed from
42:e2:89:5b:6d:de
to 08:00:27:e2:37:5f
, and the enp0s8
slave interface no
longer has the permaddr
line. Basically what seems to happen
is that when the machine boots up, bond0
gets a virtual address attached to
it, and this overwrites the address on the slave interface, but upon restarting
the networking service the opposite happens: now the slave address is
overwriting the bond0
address instead.
Simplifying the reproduction steps
As with any debugging, the more moving parts you can remove the better off you
are trying to figure out what is going on. Restarting the networking service
involves systemctl
, the ifupdown
tools etc. What are the least amount of
steps we could use to reproduce this?
Basically it boils down to creating a fresh bond0:
# ip link del dev bond0
# ip link add dev bond0 type bond
... verifying the current MAC addresses on the interfaces on the bond as well as the yet-to-be-enslaved device:
# ip link show dev bond0
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff
# ip link show dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
... adding the slave to the bond by manually writing to /sys
:
# echo +enp0s8 > /sys/class/net/bond0/bonding/slaves
... checking the new states of addresses on both the bond and slave (the bond0
address is used to overwrite the slave MAC):
# ip link show dev bond0
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff
# ip link show dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff permaddr 08:00:27:e2:37:5f
... removing the slave again:
# echo -enp0s8 > /sys/class/net/bond0/bonding/slaves
... checking the new state of the devices (bond0
now has a random address, and the slave interface is back to it's normal MAC):
# ip link show dev bond0
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 4e:25:45:6e:d9:6b brd ff:ff:ff:ff:ff:ff
# ip link show dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
... adding the slave back again:
echo +enp0s8 > /sys/class/net/bond0/bonding/slaves
... checking the addresses of the devices again (now bond0
is using the MAC address from the slave):
# ip link show dev bond0
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
# ip link show dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
It was here that a pattern emerged:
- When creating a fresh slave-less
bond0
device it would always get the made up, but constant,42:e2:89:5b:6d:de
address. - When adding a slave for the first time the slave would take on this
42:e2:89:5b:6d:de
address. - When removing the slave for the first time the
42:e2:89:5b:6d:de
address onbond0
would be replaced by a random address. - When adding the slave again the
bond0
interface would take on the address of the slave interface. - In this state any time the slave is added
bond0
will use the slave MAC, and any time the slave is removedbond0
would get a new random address.
What actually is the expected behaviour of MAC addresses used by a bonding interface then? There is some information at Where does a bonding device get its MAC address from?, namely:
If not explicitly configured (with ifconfig or ip link), the MAC address of the bonding device is taken from its first slave device. This MAC address is then passed to all following slaves and remains persistent (even if the first slave is removed) until the bonding device is brought down or reconfigured.
This information only seems to be a partial match for our experience. When initially
creating the bonding interface it is actually the bond0
MAC winning over the
slave interface, and not the other way around. Only after emptying the slave
list is this behaviour observed. This made me wonder how things are actually
implemented in the bonding driver. Lets take a look!
Some kernel internals
The bonding driver lives in drivers/net/bonding directory of the kernel source code repo, and while the real upstream repos are available at git.kernel.org I will link to a mirror on GitHub for easy reference.
Now, looking at bond_init()
in bond_main.c
we can see what happens during initialization of a bonding interface here:
/* Ensure valid dev_addr */
if (is_zero_ether_addr(bond_dev->dev_addr) &&
bond_dev->addr_assign_type == NET_ADDR_PERM)
eth_hw_addr_random(bond_dev);
So given that the MAC address is all zeros, and the addr_assign_type
is
NET_ADDR_PERM
we generate and apply a random address. What does NET_ADDR_PERM
actually mean? Well it is defined here:
/* hardware address assignment types */
#define NET_ADDR_PERM 0 /* address is permanent (default) */
#define NET_ADDR_RANDOM 1 /* address is generated randomly */
#define NET_ADDR_STOLEN 2 /* address is stolen from other device */
#define NET_ADDR_SET 3 /* address is set using
* dev_set_mac_address() */
So basically when the interface is created it comes up with a zeroed out MAC
address and an assignment type of NET_ADDR_PERM
. The bonding driver is then
supposed to overwrite this with a random address. The call to eth_hw_addr_random(bond0)
looks like this:
/**
* eth_hw_addr_random - Generate software assigned random Ethernet and
* set device flag
* @dev: pointer to net_device structure
*
* Generate a random Ethernet address (MAC) to be used by a net device
* and set addr_assign_type so the state can be read by sysfs and be
* used by userspace.
*/
static inline void eth_hw_addr_random(struct net_device *dev)
{
dev->addr_assign_type = NET_ADDR_RANDOM;
eth_random_addr(dev->dev_addr);
}
So at this point the assignment type should be updated from NET_ADDR_PERM
to
NET_ADDR_RANDOM
and the address is set to some random bytes.
Continuing on, once we add our first slave interface this is handled by the bond_enslave()
function which does this:
/* If this is the first slave, then we need to set the master's hardware
* address to be the same as the slave's.
*/
if (!bond_has_slaves(bond) &&
bond->dev->addr_assign_type == NET_ADDR_RANDOM) {
res = bond_set_dev_addr(bond->dev, slave_dev);
if (res)
goto err_undo_flags;
}
So here we have the code that is responsible for overwriting the MAC address on
the bonding interface given that it currently has no slaves, and the state of
the address is NET_ADDR_RANDOM
. The specific operations in bond_set_dev_addr()
looks like this:
memcpy(bond_dev->dev_addr, slave_dev->dev_addr, slave_dev->addr_len);
bond_dev->addr_assign_type = NET_ADDR_STOLEN;
So it overwrites the bond dev_addr
with the same field from the slave interface, and updates the assignment type to NET_ADDR_STOLEN
.
Finally, lets take a quick look at what happens when we remove the slave, this is handled by __bond_release_one
here:
if (!bond_has_slaves(bond)) {
bond_set_carrier(bond);
eth_hw_addr_random(bond_dev);
}
If there are no slaves remaining we set the carrier state to down
(based on not having any slaves inside bond_set_carrier()
) and then generate a new random MAC address together with updating the assignment type to NET_ADDR_RANDOM
again.
So how does that match with what we are actually seeing? Well from bond_init()
it looks like we should be getting a new random MAC address any time we create a fresh bond0
, but this is not what it looks like on our machine. Instead we are always getting the same 42:e2:89:5b:6d:de
address.
I did however think that I had a reasonable idea why removing the slave would "reset" the state of the bond0
interface. The code in __bond_release_one()
unconditionally sets a random address with the NET_ADDR_RANDOM
assignment type on the bond interface once the last slave is removed. This would then make a follow-up bond_enslave()
(when adding the slave back) fulfill its bond->dev->addr_assign_type == NET_ADDR_RANDOM
check and make bond0
use the slave MAC.
In a way I think it is a bit strange that the bonding driver honors a configured address when adding the first slave, but not honoring it when removing the slave again. Digging around in commit logs it seems the commit that introduced the bond->dev->addr_assign_type == NET_ADDR_RANDOM
check intentionally left it this way:
[...] In all cases if the bond device is left without any slaves - its MAC gets reset to a random one as before.
What I did not understand is what the state of bond0
was when newly created. It obviously was not able to pass the bond->dev->addr_assign_type == NET_ADDR_RANDOM
check in bond_enslave()
, since it opted to retain its own adress instead of taking on the slave MAC. My initial goal was to verify the state in bond_init()
: was it able to pass the is_zero_ether_addr(bond_dev->dev_addr) && bond_dev->addr_assign_type == NET_ADDR_PERM
check which was a requirement to set the NET_ADDR_RANDOM
state that would be necessary to have the "standard" behaviour of immediatly using the slave MAC?
Unfortunately I did not think I had the appropriate debugging messages available to verify this, therefore...
printf()-debugging the kernel
What I ideally wanted to do was to verify if the addr_assign_type
state of bond0
was being set as expected before and after the bond_init()
code shown earlier. If I could only print the contents of addr_assign_type
before and after that code...
I have not built a Linux kernel for years, but it turns out Debian has some decent documentation to do this. I followed the steps layed out in 4.2. Rebuilding official Debian kernel packages, and what followed was basically a loop of me sprinkling netdev_dbg()
and slave_dbg()
calls around bond_main.c
, running bash debian/bin/test-patches ../my-bond-patch.patch
, installing the resulting .deb package, rebooting, and seeing what I would learn. The first .deb I created was not accepted by dpkg
, but based on the help given to somone else in a mailing list thread, Re: Proper way to apply and test a small kernel patch?, I realized I needed to bump the abiname
setting in debian/config/defines
before building, and then I was able to reboot into my modified kernel.
I soon noticed that my debug printing was not turning up anywhere... what now? Well, it turns out the debug messages emitted by netdev_dbg()
and slave_dbg()
are dynamically managed by something called Dynamic debug (dyndbg), controlled from a file called /sys/kernel/debug/dynamic_debug/control
. Interesting! Looking at it it contained lines such as the following (the bonding-related lines only appeared after the bonding kernel module was loaded):
# filename:lineno [module]function flags format
drivers/net/bonding/bond_main.c:5190 [bonding]bond_init =_ "Begin bond_init\012"
drivers/net/bonding/bond_main.c:5063 [bonding]bond_check_params =_ "Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details\012"
drivers/net/bonding/bond_main.c:4374 [bonding]bond_update_slave_arr =_ "(slave %s): Adding slave to tx hash array[%d]\012"
drivers/net/bonding/bond_main.c:4350 [bonding]bond_update_slave_arr =_ "bond_3ad_get_active_agg_info failed\012"
drivers/net/bonding/bond_main.c:4083 [bonding]bond_set_mac_address =_ "(slave %s): %s: unwind err %d\012"
drivers/net/bonding/bond_main.c:4059 [bonding]bond_set_mac_address =_ "(slave %s): %s: err %d\012"
drivers/net/bonding/bond_main.c:4049 [bonding]bond_set_mac_address =_ "(slave %s): %s: slave=%p\012"
drivers/net/bonding/bond_main.c:4036 [bonding]bond_set_mac_address =_ "%s: bond=%p\012"
The =_
means that the specific log message defined on that line is currenly disabled (has no flags), and a =p
means the p
flag ("enables the pr_debug() callsite") is set and the message will be printed. You control flags set per message by writing strings to the file. While you are able to control down to the specific line number you want to enable, I just went with enabling all logs related to bond_main.c
like so:
echo 'file bond_main.c +p' > /sys/kernel/debug/dynamic_debug/control
Resulting in lines like this:
# filename:lineno [module]function flags format
drivers/net/bonding/bond_main.c:5190 [bonding]bond_init =p "Begin bond_init\012"
drivers/net/bonding/bond_main.c:5063 [bonding]bond_check_params =p "Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link fai
lures! see bonding.txt for details\012"
drivers/net/bonding/bond_main.c:4374 [bonding]bond_update_slave_arr =p "(slave %s): Adding slave to tx hash array[%d]\012"
drivers/net/bonding/bond_main.c:4350 [bonding]bond_update_slave_arr =p "bond_3ad_get_active_agg_info failed\012"
drivers/net/bonding/bond_main.c:4083 [bonding]bond_set_mac_address =p "(slave %s): %s: unwind err %d\012"
drivers/net/bonding/bond_main.c:4059 [bonding]bond_set_mac_address =p "(slave %s): %s: err %d\012"
drivers/net/bonding/bond_main.c:4049 [bonding]bond_set_mac_address =p "(slave %s): %s: slave=%p\012"
drivers/net/bonding/bond_main.c:4036 [bonding]bond_set_mac_address =p "%s: bond=%p\012"
It was pretty cool to open that file after booting my modifed kernel and instantly seeing references to my own hastily added debug calls :).
So it wasnt the bonding driver after all
With my added debug logging now showing up in dmesg
I pretty soon learned the following:
- The
bond_init()
code actually did manage to update theNET_ADDR_PERM
toNET_ADDR_RANDOM
on the newly createdbond0
. - However, when the
bond_enslave()
code was checking forNET_ADDR_RANDOM
being set to decide if it should use the slave MAC address, it was actaully set to...NET_ADDR_SET
.
OK? So we have seen how NET_ADDR_PERM
is the default value when creating the fresh bond0
interface, that bond_init()
updates it to NET_ADDR_RANDOM
, and bond_enslave()
(if it sees that bond0
is currently in that NET_ADDR_RANDOM
state) will apply the slave MAC address and mark bond0
as NET_ADDR_STOLEN
. What then is this NET_ADDR_SET
doing there? As we could see in the definitions shown earlier, NET_ADDR_SET
is the state when "address is set using dev_set_mac_address()
". Basically dev_set_mac_address()
is what is called when someone (or something) manually configures a MAC address on a device.
It was here somewhere that I noticed that you can actually inspect the current addr_assign_type
of an interface by checking the integer in /sys/class/net/bond0/addr_assign_type
, so after creating the bond0
interface:
# cat /sys/class/net/bond0/addr_assign_type
3
Indeed it is set to 3 (NET_ADDR_SET
). And after adding and removing the slave interface it had then been modified by the __bond_release_one()
function to be NET_ADDR_RANDOM
(1) again:
# echo +enp0s8 > /sys/class/net/bond0/bonding/slaves
# echo -enp0s8 > /sys/class/net/bond0/bonding/slaves
# cat /sys/class/net/bond0/addr_assign_type
1
Here I quickly jumped over to another machine running Debian Buster, and I noticed that when creating a bond0
interface via ip link add dev bond0 type bond
it would actually end up with a state of NET_ADDR_RANDOM
. Huh. So for some reason, the older release brings up bond0
like expected, but on Bullseye it instead ends up with NET_ADDR_SET
. Interesting.
So to reiterate: we now know that for some reason bond0
ends up in the NET_ADDR_SET
state on creation, and that this is a result of something calling dev_set_mac_address()
. I had already gone through and added printf-debugging to all places in bond_main.c
where dev_set_mac_address()
could be called, and none of them fired on creation. For this reason I no longer thought it was the bond driver that was the culprit. But adding debug logging to all places in the kernel calling dev_set_mac_address()
? That didn't seem reasonable. If only there was a way to instrument the kernel to tell me when the function was called and ideally show a backtrace of how we ended up there... oh wait, there is!
Activate X-ray vision
There is a lot of buzz surrounding eBPF these days, it is used in all manner of places, quoting from https://ebpf.io:
Today, eBPF is used extensively to drive a wide variety of use cases: Providing high-performance networking and load-balancing in modern data centers and cloud native environments, extracting fine-grained security observability data at low overhead, helping application developers trace applications, providing insights for performance troubleshooting, preventive application and container runtime security enforcement, and much more. The possibilities are endless, and the innovation that eBPF is unlocked has only just begun.
In this case I am not trying to build a loadbalancer, rather I am interested in using eBPF to attach custom debug output to a kernel function: dev_set_mac_address()
, I am looking at you. How would you go about doing that? There are multiple eBPF "frontends" available for this, giving different levels of abstraction. Brendan Greggs Linux Extended BPF (eBPF) Tracing Tools article has a great overview of these alternaties. In my case I opted for bpftrace
:
# apt install bpftrace
This installs the bpftrace
binary as well as a collection of pre-written bpftrace
scripts for investigating different aspects of what the machine is up to. Want to see all open()
calls run by anyone on the machine? There's an app... err... script for that! One problem though:
# opensnoop.bt
/bpftrace/include/clang_workarounds.h:14:10: fatal error: 'linux/types.h' file not found
You need to install kernel headers for these tools to work, so let's try that again:
# apt install linux-headers-$(uname -r)
# opensnoop.bt
Attaching 6 probes...
Tracing open syscalls... Hit Ctrl-C to end.
PID COMM FD ERR PATH
... and now try opening a file somewhere, maybe cat /etc/passwd
:
[...]
5492 cat 3 0 /etc/ld.so.cache
5492 cat 3 0 /lib/x86_64-linux-gnu/libc.so.6
5492 cat 3 0 /usr/lib/locale/locale-archive
5492 cat 3 0 /usr/share/locale/locale.alias
5492 cat -1 2 /usr/lib/locale/UTF-8/LC_CTYPE
5492 cat 3 0 /etc/passwd
In this case my idea was to use the bpftrace
one-liner mode where you can run code based on specific things you are looking for. In my case I was interested if anything was calling dev_set_mac_address()
and if possible, give some hints how we got there. For more details see the bpftrace Reference Guide.
Lets see if there is any probe we can use that fits this need:
# bpftrace -lv '*dev_set_mac_address*'
BTF: using data from /sys/kernel/btf/vmlinux
kprobe:dev_set_mac_address
kprobe:dev_set_mac_address_user
kfunc:dev_set_mac_address_user
struct net_device * dev;
struct sockaddr * sa;
struct netlink_ext_ack * extack;
int retval;
kfunc:dev_set_mac_address
struct net_device * dev;
struct sockaddr * sa;
struct netlink_ext_ack * extack;
int retval;
The way I understand it kprobe is the traditional way of instrumenting kernel functions, and kfunc is a more efficient way of doing the same thing (if it is available to you). In my case I could see it was not available, here trying a basic print call any time the function is called:
# bpftrace -e 'kfunc:dev_set_mac_address { printf("test\n"); }'
stdin:1:1-26: ERROR: kfunc/kretfunc not available for your kernel version.
kfunc:dev_set_mac_address { printf("test\n"); }
~~~~~~~~~~~~~~~~~~~~~~~~~
So, going back to kprobe, it works better:
# bpftrace -e 'kprobe:dev_set_mac_address { printf("%s was called\n", func); }'
Attaching 1 probe...
Now, when creating a bond0
interface in another window via ip link add dev bond0 type bond
the message dev_set_mac_address was called
was printed. This proves that something is calling dev_set_mac_address()
but it would be nice to get some more context. One such tool is the kstack
builtin, so restarting bpftrace
with such a print and recreating bond0
the following was printed:
# bpftrace -e 'kprobe:dev_set_mac_address { printf("%s\n", kstack); }'
Attaching 1 probe...
dev_set_mac_address+1
dev_set_mac_address_user+45
do_setlink+1716
rtnl_setlink+229
rtnetlink_rcv_msg+300
netlink_rcv_skb+80
netlink_unicast+513
netlink_sendmsg+579
sock_sendmsg+94
__sys_sendto+238
__x64_sys_sendto+37
do_syscall_64+51
entry_SYSCALL_64_after_hwframe+68
OK, now we are getting somewhere. It looks like whatever is setting the MAC address for us is doing so via the netlink interface, specifically rtnetlink:
Rtnetlink allows the kernel's routing tables to be read and altered. It is used within the kernel to communicate between various subsystems, though this usage is not documented here, and for communication with user-space programs. Network routes, IP addresses, link parameters, neighbor setups, queueing disciplines, traffic classes and packet classifiers may all be controlled through NETLINK_ROUTE sockets. It is based on netlink messages; see netlink(7) for more information.
Inspecting rtnetlink signaling
I spent some time trying to figure out if there was a good way to monitor the rtnetlink messages, possibly being able to tell me what the source of the message was. I found the nice Debugging netlink requests post by Julia Evans which enumerated multiple options for investigating them. I was however not able to figure out the source of the messages this way. Learning about the monitor
mode for ip
was pretty cool though, showing the following when adding bond0
:
# ip monitor
inet bond0 forwarding off rp_filter off mc_forwarding off proxy_neigh off ignore_routes_with_linkdown off
inet6 bond0 forwarding off mc_forwarding off proxy_neigh off ignore_routes_with_linkdown off
19: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
link/ether e6:f3:6a:ad:a8:5a brd ff:ff:ff:ff:ff:ff
19: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff
This made it clear that the bond0
interface actually was created with a
random MAC in bond_init()
as my previous printf-debugging showed, but that a
follow-up modification then overwrites it with that 42:e2:89:5b:6d:de
address.
A quick aside
I wish I could tell you that this step-by-step sleuthing deterministically brought me forward and eventually gave me the obvious solution. But at this point I was many hours deep into this rabbit hole: wrapping my head around the bonding driver, having built multiple custom kernels, learning about the Dynamic debug stuff, trying to figure out the operation of rtnetlink messaging and trying to make bpftrace
grant me the X-ray vision I was sorely needing. At this point I was tired and tried to lean back for a while. This is when I casually pondered "what else might be having ideas about interfaces", and it dawned on me: udev
.
Can the real dev_set_mac_address() caller please stand up?
Let's quote us some Wikipedia, namely the udev article:
udev is a generic device manager running as a daemon on a Linux system and listening (via a netlink socket) to uevents the kernel sends out if a new device is initialized or a device is removed from the system. The udev package comes with an extensive set of rules that match against exported values of the event and properties of the discovered device. A matching rule will possibly name and create a device node and run configured programs to set up and configure the device.
While my interaction with udev
has generally been kept sparse over the years, it did feel like a good suspect. First of all, is it running?
# ps auxww | grep udev
root 261 0.0 0.0 21668 5404 ? Ss 14:55 0:00 /lib/systemd/systemd-udevd
Hello there systemd-udevd
. It was at this point I was crossing my fingers, reaching for my old companion strace
:
# strace -f -p 261
[...]
6240 sendto(15, {{len=44, type=RTM_SETLINK, flags=NLM_F_REQUEST|NLM_F_ACK, seq=2, pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=if_nametoindex("bond0"), ifi_flags=0, ifi_change=0}, {{n
la_len=10, nla_type=IFLA_ADDRESS}, 42:e2:89:5b:6d:de}}, 44, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 16 <unfinished ...>
[...]
There it was. A rtnetlink RTM_SETLINK
message, containing that all too
familiar 42:e2:89:5b:6d:de
address. It turns out it is systemd-udevd
generating a
"persistent" address when it sees the creation of bond0
. Why was this not a
problem on the Buster machine? Well, a quick look with journalctl
on that
machine revealed the following:
Sep 25 04:24:33 hostname systemd-udevd[27065]: Could not generate persistent MAC address for bond0: No such file or directory
This error is discussed in this ticket in the systemd GitHub repo. Probably the fixes has eventually landed in Bullseye and here we are. This seems likely as the logs on Buster includes this information:
Sep 25 04:25:54 hostname systemd-udevd[27123]: Using default interface naming scheme 'v240'.
... while the Bullseye logs look like this:
Sep 24 14:18:00 hostname systemd-udevd[4156]: Using default interface naming scheme 'v247'.
... and based on the naming scheme history available at systemd.net-naming-scheme it seems like this was introduced in v241
:
v241
MACAddressPolicy=persistent was extended to set MAC addresses based on the device name. Previously addresses were only based on the ID_NET_NAME_* attributes, which meant that interface names would never be generated for virtual devices. Now a persistent address will be generated for most devices, including in particular bridges.
Note: when userspace does not set a MAC address for a bridge device, the kernel will initially assign a random address, and then change it when the first device is enslaved to the bridge. With this naming policy change, bridges get a persistent MAC address based on the bridge name instead of the first enslaved device.
Funnily one of the comments from the github ticket has this to say:
For those who are having problem with this, you should be able work around it by copying /usr/lib/systemd/network/99-default.link to /etc/systemd/network/99-default.link and replace MACAddressPolicy=persistent with MACAddressPolicy=none in the latter, which should prevent udev from doing anything relevant. [...]
This is when I let out a sigh of relief: I had found the culprit and the world made sense again. The only thing that bothered me was the way I actually discovered the source of the behaviour. I basically made a leap of faith into udev
and got lucky. That didn't feel right.
You promised me X-ray vision!
I figured there must be a way to have bpftrace
tell me what was going on, and while writing this post I realized I could just do this:
# bpftrace -e 'kprobe:dev_set_mac_address { printf("%s: %s\n", comm, kstack); }'
Attaching 1 probe...
systemd-udevd:
dev_set_mac_address+1
dev_set_mac_address_user+45
do_setlink+1716
rtnl_setlink+229
rtnetlink_rcv_msg+300
netlink_rcv_skb+80
netlink_unicast+513
netlink_sendmsg+579
sock_sendmsg+94
__sys_sendto+238
__x64_sys_sendto+37
do_syscall_64+51
entry_SYSCALL_64_after_hwframe+68
Face: meet palm
As can be seen above all I would have needed to do was to print another of the builtin bpftrace
variables, comm
. This way the offending party revealed itself without any need for random guessing. I recall that the introductory chapter of the BPF Performance Tools book contains the following quote on the subject of BPF tracing visibility:
It can feel like having X-ray vision: When you need to examine some deep kernel component, device, or application library, you can see into it in a way that no one ever has before - live and in production.
As you can probably tell, my use of the word "X-ray" in this post is fully inspired
by that sentence, but more importantly this is exactly how it feels. The
debugging potential of eBPF tools like bpftrace
are off the charts. Here I
poked around with the kernel probes, I have not even looked into probes aimed
at userland applications. It is well worth a look by anyone wrangling misbehaving silicon for a living.