A tale of bonding interfaces, Linux kernel debugging, eBPF tools and netlink messages.

How it started

The other day a colleague and I were experimenting with bonding interfaces on a Debian Bullseye machine. At one point we tried restarting the networking service (systemctl restart networking) and noticed that the MAC address used by the bond0 interface was changed.

This felt weird. Why would a plain restart of the networking not result in a consistent outcome when compared with the state after a machine reboot?

Preparing a machine for testing

To see it in action all you need to do is to install ifenslave and then modify /etc/network/interfaces to include bond settings as described in the Official Debian Configuration Example (of course with some tweaks to match your available interfaces etc). In my case it ended up looking like so:

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

auto enp0s3

# The primary network interface
allow-hotplug enp0s3
iface enp0s3 inet dhcp

auto bond0

iface bond0 inet static
    address 10.31.1.5
    netmask 255.255.255.0
    bond-slaves enp0s8
    bond-mode active-backup
    bond-miimon 100
    bond-downdelay 200
    bond-updelay 200

... where enp0s3 is the interface I am reaching the machine over (so just ignore it below, is is not interesting for our purposes) and enp0s8 is an unused interface that will act as the bonding slave.

Setting the stage

Now, upon restarting the machine you will see something like this:

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 08:00:27:19:f3:3e brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.127/24 brd 192.168.1.255 scope global dynamic enp0s3
       valid_lft 43128sec preferred_lft 43128sec
    inet6 fe80::a00:27ff:fe19:f33e/64 scope link
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UNKNOWN group default qlen 1000
    link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff permaddr 08:00:27:e2:37:5f
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff
    inet 10.31.1.5/24 brd 10.31.1.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::40e2:89ff:fe5b:6dde/64 scope link
       valid_lft forever preferred_lft forever

The important detail is that bond0 and enp0s8 has a made up MAC address of 42:e2:89:5b:6d:de, you can also see enp0s8 has a permaddr 08:00:27:e2:37:5f field, signifying that it is currently having it's MAC address overwritten by the made up address on bond0.

Now, watch what happens if we restart the networking:

# systemctl restart networking
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 08:00:27:19:f3:3e brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.127/24 brd 192.168.1.255 scope global dynamic enp0s3
       valid_lft 43198sec preferred_lft 43198sec
    inet6 fe80::a00:27ff:fe19:f33e/64 scope link
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UNKNOWN group default qlen 1000
    link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
    inet 10.31.1.5/24 brd 10.31.1.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fee2:375f/64 scope link
       valid_lft forever preferred_lft forever

Now the MAC address of bond0 has changed from 42:e2:89:5b:6d:de to 08:00:27:e2:37:5f, and the enp0s8 slave interface no longer has the permaddr line. Basically what seems to happen is that when the machine boots up, bond0 gets a virtual address attached to it, and this overwrites the address on the slave interface, but upon restarting the networking service the opposite happens: now the slave address is overwriting the bond0 address instead.

Simplifying the reproduction steps

As with any debugging, the more moving parts you can remove the better off you are trying to figure out what is going on. Restarting the networking service involves systemctl, the ifupdown tools etc. What are the least amount of steps we could use to reproduce this?

Basically it boils down to creating a fresh bond0:

# ip link del dev bond0
# ip link add dev bond0 type bond

... verifying the current MAC addresses on the interfaces on the bond as well as the yet-to-be-enslaved device:

# ip link show dev bond0
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff
# ip link show dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
    link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff

... adding the slave to the bond by manually writing to /sys:

# echo +enp0s8 > /sys/class/net/bond0/bonding/slaves

... checking the new states of addresses on both the bond and slave (the bond0 address is used to overwrite the slave MAC):

# ip link show dev bond0
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff
# ip link show dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff permaddr 08:00:27:e2:37:5f

... removing the slave again:

# echo -enp0s8 > /sys/class/net/bond0/bonding/slaves

... checking the new state of the devices (bond0 now has a random address, and the slave interface is back to it's normal MAC):

# ip link show dev bond0
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 4e:25:45:6e:d9:6b brd ff:ff:ff:ff:ff:ff
# ip link show dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
    link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff

... adding the slave back again:

echo +enp0s8 > /sys/class/net/bond0/bonding/slaves

... checking the addresses of the devices again (now bond0 is using the MAC address from the slave):

# ip link show dev bond0
5: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff
# ip link show dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 08:00:27:e2:37:5f brd ff:ff:ff:ff:ff:ff

It was here that a pattern emerged:

When creating a fresh slave-less bond0 device it would always get the made up, but constant, 42:e2:89:5b:6d:de address.
When adding a slave for the first time the slave would take on this 42:e2:89:5b:6d:de address.
When removing the slave for the first time the 42:e2:89:5b:6d:de address on bond0 would be replaced by a random address.
When adding the slave again the bond0 interface would take on the address of the slave interface.
In this state any time the slave is added bond0 will use the slave MAC, and any time the slave is removed bond0 would get a new random address.

What actually is the expected behaviour of MAC addresses used by a bonding interface then? There is some information at Where does a bonding device get its MAC address from?, namely:

If not explicitly configured (with ifconfig or ip link), the MAC address of the bonding device is taken from its first slave device. This MAC address is then passed to all following slaves and remains persistent (even if the first slave is removed) until the bonding device is brought down or reconfigured.

This information only seems to be a partial match for our experience. When initially creating the bonding interface it is actually the bond0 MAC winning over the slave interface, and not the other way around. Only after emptying the slave list is this behaviour observed. This made me wonder how things are actually implemented in the bonding driver. Lets take a look!

Some kernel internals

The bonding driver lives in drivers/net/bonding directory of the kernel source code repo, and while the real upstream repos are available at git.kernel.org I will link to a mirror on GitHub for easy reference.

Now, looking at bond_init() in bond_main.c we can see what happens during initialization of a bonding interface here:

/* Ensure valid dev_addr */
if (is_zero_ether_addr(bond_dev->dev_addr) &&
    bond_dev->addr_assign_type == NET_ADDR_PERM)
	eth_hw_addr_random(bond_dev);

So given that the MAC address is all zeros, and the addr_assign_type is NET_ADDR_PERM we generate and apply a random address. What does NET_ADDR_PERM actually mean? Well it is defined here:

/* hardware address assignment types */
#define NET_ADDR_PERM		0	/* address is permanent (default) */
#define NET_ADDR_RANDOM		1	/* address is generated randomly */
#define NET_ADDR_STOLEN		2	/* address is stolen from other device */
#define NET_ADDR_SET		3	/* address is set using
					 * dev_set_mac_address() */

So basically when the interface is created it comes up with a zeroed out MAC address and an assignment type of NET_ADDR_PERM. The bonding driver is then supposed to overwrite this with a random address. The call to eth_hw_addr_random(bond0) looks like this:

/**
 * eth_hw_addr_random - Generate software assigned random Ethernet and
 * set device flag
 * @dev: pointer to net_device structure
 *
 * Generate a random Ethernet address (MAC) to be used by a net device
 * and set addr_assign_type so the state can be read by sysfs and be
 * used by userspace.
 */
static inline void eth_hw_addr_random(struct net_device *dev)
{
	dev->addr_assign_type = NET_ADDR_RANDOM;
	eth_random_addr(dev->dev_addr);
}

So at this point the assignment type should be updated from NET_ADDR_PERM to NET_ADDR_RANDOM and the address is set to some random bytes.

Continuing on, once we add our first slave interface this is handled by the bond_enslave() function which does this:

/* If this is the first slave, then we need to set the master's hardware
 * address to be the same as the slave's.
 */
if (!bond_has_slaves(bond) &&
    bond->dev->addr_assign_type == NET_ADDR_RANDOM) {
	res = bond_set_dev_addr(bond->dev, slave_dev);
	if (res)
		goto err_undo_flags;
}

So here we have the code that is responsible for overwriting the MAC address on the bonding interface given that it currently has no slaves, and the state of the address is NET_ADDR_RANDOM. The specific operations in bond_set_dev_addr() looks like this:

memcpy(bond_dev->dev_addr, slave_dev->dev_addr, slave_dev->addr_len);
bond_dev->addr_assign_type = NET_ADDR_STOLEN;

So it overwrites the bond dev_addr with the same field from the slave interface, and updates the assignment type to NET_ADDR_STOLEN.

Finally, lets take a quick look at what happens when we remove the slave, this is handled by __bond_release_one here:

if (!bond_has_slaves(bond)) {
	bond_set_carrier(bond);
	eth_hw_addr_random(bond_dev);
}

If there are no slaves remaining we set the carrier state to down (based on not having any slaves inside bond_set_carrier()) and then generate a new random MAC address together with updating the assignment type to NET_ADDR_RANDOM again.

So how does that match with what we are actually seeing? Well from bond_init() it looks like we should be getting a new random MAC address any time we create a fresh bond0, but this is not what it looks like on our machine. Instead we are always getting the same 42:e2:89:5b:6d:de address.

I did however think that I had a reasonable idea why removing the slave would "reset" the state of the bond0 interface. The code in __bond_release_one() unconditionally sets a random address with the NET_ADDR_RANDOM assignment type on the bond interface once the last slave is removed. This would then make a follow-up bond_enslave() (when adding the slave back) fulfill its bond->dev->addr_assign_type == NET_ADDR_RANDOM check and make bond0 use the slave MAC.

In a way I think it is a bit strange that the bonding driver honors a configured address when adding the first slave, but not honoring it when removing the slave again. Digging around in commit logs it seems the commit that introduced the bond->dev->addr_assign_type == NET_ADDR_RANDOM check intentionally left it this way:

[...] In all cases if the bond device is left without any slaves - its MAC gets reset to a random one as before.

What I did not understand is what the state of bond0 was when newly created. It obviously was not able to pass the bond->dev->addr_assign_type == NET_ADDR_RANDOM check in bond_enslave(), since it opted to retain its own adress instead of taking on the slave MAC. My initial goal was to verify the state in bond_init(): was it able to pass the is_zero_ether_addr(bond_dev->dev_addr) && bond_dev->addr_assign_type == NET_ADDR_PERM check which was a requirement to set the NET_ADDR_RANDOM state that would be necessary to have the "standard" behaviour of immediatly using the slave MAC?

Unfortunately I did not think I had the appropriate debugging messages available to verify this, therefore...

printf()-debugging the kernel

What I ideally wanted to do was to verify if the addr_assign_type state of bond0 was being set as expected before and after the bond_init() code shown earlier. If I could only print the contents of addr_assign_type before and after that code...

I have not built a Linux kernel for years, but it turns out Debian has some decent documentation to do this. I followed the steps layed out in 4.2. Rebuilding official Debian kernel packages, and what followed was basically a loop of me sprinkling netdev_dbg() and slave_dbg()calls around bond_main.c, running bash debian/bin/test-patches ../my-bond-patch.patch, installing the resulting .deb package, rebooting, and seeing what I would learn. The first .deb I created was not accepted by dpkg, but based on the help given to somone else in a mailing list thread, Re: Proper way to apply and test a small kernel patch?, I realized I needed to bump the abiname setting in debian/config/defines before building, and then I was able to reboot into my modified kernel.

I soon noticed that my debug printing was not turning up anywhere... what now? Well, it turns out the debug messages emitted by netdev_dbg() and slave_dbg() are dynamically managed by something called Dynamic debug (dyndbg), controlled from a file called /sys/kernel/debug/dynamic_debug/control. Interesting! Looking at it it contained lines such as the following (the bonding-related lines only appeared after the bonding kernel module was loaded):

# filename:lineno [module]function flags format
drivers/net/bonding/bond_main.c:5190 [bonding]bond_init =_ "Begin bond_init\012"
drivers/net/bonding/bond_main.c:5063 [bonding]bond_check_params =_ "Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details\012"
drivers/net/bonding/bond_main.c:4374 [bonding]bond_update_slave_arr =_ "(slave %s): Adding slave to tx hash array[%d]\012"
drivers/net/bonding/bond_main.c:4350 [bonding]bond_update_slave_arr =_ "bond_3ad_get_active_agg_info failed\012"
drivers/net/bonding/bond_main.c:4083 [bonding]bond_set_mac_address =_ "(slave %s): %s: unwind err %d\012"
drivers/net/bonding/bond_main.c:4059 [bonding]bond_set_mac_address =_ "(slave %s): %s: err %d\012"
drivers/net/bonding/bond_main.c:4049 [bonding]bond_set_mac_address =_ "(slave %s): %s: slave=%p\012"
drivers/net/bonding/bond_main.c:4036 [bonding]bond_set_mac_address =_ "%s: bond=%p\012"

The =_ means that the specific log message defined on that line is currenly disabled (has no flags), and a =p means the p flag ("enables the pr_debug() callsite") is set and the message will be printed. You control flags set per message by writing strings to the file. While you are able to control down to the specific line number you want to enable, I just went with enabling all logs related to bond_main.c like so:

echo 'file bond_main.c +p' > /sys/kernel/debug/dynamic_debug/control

Resulting in lines like this:

# filename:lineno [module]function flags format
drivers/net/bonding/bond_main.c:5190 [bonding]bond_init =p "Begin bond_init\012"
drivers/net/bonding/bond_main.c:5063 [bonding]bond_check_params =p "Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link fai
lures! see bonding.txt for details\012"
drivers/net/bonding/bond_main.c:4374 [bonding]bond_update_slave_arr =p "(slave %s): Adding slave to tx hash array[%d]\012"
drivers/net/bonding/bond_main.c:4350 [bonding]bond_update_slave_arr =p "bond_3ad_get_active_agg_info failed\012"
drivers/net/bonding/bond_main.c:4083 [bonding]bond_set_mac_address =p "(slave %s): %s: unwind err %d\012"
drivers/net/bonding/bond_main.c:4059 [bonding]bond_set_mac_address =p "(slave %s): %s: err %d\012"
drivers/net/bonding/bond_main.c:4049 [bonding]bond_set_mac_address =p "(slave %s): %s: slave=%p\012"
drivers/net/bonding/bond_main.c:4036 [bonding]bond_set_mac_address =p "%s: bond=%p\012"

It was pretty cool to open that file after booting my modifed kernel and instantly seeing references to my own hastily added debug calls :).

So it wasnt the bonding driver after all

With my added debug logging now showing up in dmesg I pretty soon learned the following:

The bond_init() code actually did manage to update the NET_ADDR_PERM to NET_ADDR_RANDOM on the newly created bond0.
However, when the bond_enslave() code was checking for NET_ADDR_RANDOM being set to decide if it should use the slave MAC address, it was actaully set to... NET_ADDR_SET.

OK? So we have seen how NET_ADDR_PERM is the default value when creating the fresh bond0 interface, that bond_init() updates it to NET_ADDR_RANDOM, and bond_enslave() (if it sees that bond0 is currently in that NET_ADDR_RANDOM state) will apply the slave MAC address and mark bond0 as NET_ADDR_STOLEN. What then is this NET_ADDR_SET doing there? As we could see in the definitions shown earlier, NET_ADDR_SET is the state when "address is set using dev_set_mac_address()". Basically dev_set_mac_address() is what is called when someone (or something) manually configures a MAC address on a device.

It was here somewhere that I noticed that you can actually inspect the current addr_assign_type of an interface by checking the integer in /sys/class/net/bond0/addr_assign_type, so after creating the bond0 interface:

# cat /sys/class/net/bond0/addr_assign_type
3

Indeed it is set to 3 (NET_ADDR_SET). And after adding and removing the slave interface it had then been modified by the __bond_release_one() function to be NET_ADDR_RANDOM (1) again:

# echo +enp0s8 > /sys/class/net/bond0/bonding/slaves
# echo -enp0s8 > /sys/class/net/bond0/bonding/slaves
# cat /sys/class/net/bond0/addr_assign_type
1

Here I quickly jumped over to another machine running Debian Buster, and I noticed that when creating a bond0 interface via ip link add dev bond0 type bond it would actually end up with a state of NET_ADDR_RANDOM. Huh. So for some reason, the older release brings up bond0 like expected, but on Bullseye it instead ends up with NET_ADDR_SET. Interesting.

So to reiterate: we now know that for some reason bond0 ends up in the NET_ADDR_SET state on creation, and that this is a result of something calling dev_set_mac_address(). I had already gone through and added printf-debugging to all places in bond_main.c where dev_set_mac_address() could be called, and none of them fired on creation. For this reason I no longer thought it was the bond driver that was the culprit. But adding debug logging to all places in the kernel calling dev_set_mac_address()? That didn't seem reasonable. If only there was a way to instrument the kernel to tell me when the function was called and ideally show a backtrace of how we ended up there... oh wait, there is!

Activate X-ray vision

There is a lot of buzz surrounding eBPF these days, it is used in all manner of places, quoting from https://ebpf.io:

Today, eBPF is used extensively to drive a wide variety of use cases: Providing high-performance networking and load-balancing in modern data centers and cloud native environments, extracting fine-grained security observability data at low overhead, helping application developers trace applications, providing insights for performance troubleshooting, preventive application and container runtime security enforcement, and much more. The possibilities are endless, and the innovation that eBPF is unlocked has only just begun.

In this case I am not trying to build a loadbalancer, rather I am interested in using eBPF to attach custom debug output to a kernel function: dev_set_mac_address(), I am looking at you. How would you go about doing that? There are multiple eBPF "frontends" available for this, giving different levels of abstraction. Brendan Greggs Linux Extended BPF (eBPF) Tracing Tools article has a great overview of these alternaties. In my case I opted for bpftrace:

# apt install bpftrace

This installs the bpftrace binary as well as a collection of pre-written bpftrace scripts for investigating different aspects of what the machine is up to. Want to see all open() calls run by anyone on the machine? There's an app... err... script for that! One problem though:

# opensnoop.bt
/bpftrace/include/clang_workarounds.h:14:10: fatal error: 'linux/types.h' file not found

You need to install kernel headers for these tools to work, so let's try that again:

# apt install linux-headers-$(uname -r)
# opensnoop.bt
Attaching 6 probes...
Tracing open syscalls... Hit Ctrl-C to end.
PID    COMM               FD ERR PATH

... and now try opening a file somewhere, maybe cat /etc/passwd:

[...]
5492   cat                 3   0 /etc/ld.so.cache
5492   cat                 3   0 /lib/x86_64-linux-gnu/libc.so.6
5492   cat                 3   0 /usr/lib/locale/locale-archive
5492   cat                 3   0 /usr/share/locale/locale.alias
5492   cat                -1   2 /usr/lib/locale/UTF-8/LC_CTYPE
5492   cat                 3   0 /etc/passwd

In this case my idea was to use the bpftrace one-liner mode where you can run code based on specific things you are looking for. In my case I was interested if anything was calling dev_set_mac_address() and if possible, give some hints how we got there. For more details see the bpftrace Reference Guide.

Lets see if there is any probe we can use that fits this need:

# bpftrace -lv '*dev_set_mac_address*'
BTF: using data from /sys/kernel/btf/vmlinux
kprobe:dev_set_mac_address
kprobe:dev_set_mac_address_user
kfunc:dev_set_mac_address_user
    struct net_device * dev;
    struct sockaddr * sa;
    struct netlink_ext_ack * extack;
    int retval;
kfunc:dev_set_mac_address
    struct net_device * dev;
    struct sockaddr * sa;
    struct netlink_ext_ack * extack;
    int retval;

The way I understand it kprobe is the traditional way of instrumenting kernel functions, and kfunc is a more efficient way of doing the same thing (if it is available to you). In my case I could see it was not available, here trying a basic print call any time the function is called:

# bpftrace -e 'kfunc:dev_set_mac_address { printf("test\n"); }'
stdin:1:1-26: ERROR: kfunc/kretfunc not available for your kernel version.
kfunc:dev_set_mac_address { printf("test\n"); }
~~~~~~~~~~~~~~~~~~~~~~~~~

So, going back to kprobe, it works better:

# bpftrace -e 'kprobe:dev_set_mac_address { printf("%s was called\n", func); }'
Attaching 1 probe...

Now, when creating a bond0 interface in another window via ip link add dev bond0 type bond the message dev_set_mac_address was called was printed. This proves that something is calling dev_set_mac_address() but it would be nice to get some more context. One such tool is the kstack builtin, so restarting bpftrace with such a print and recreating bond0 the following was printed:

# bpftrace -e 'kprobe:dev_set_mac_address { printf("%s\n", kstack); }'
Attaching 1 probe...

        dev_set_mac_address+1
        dev_set_mac_address_user+45
        do_setlink+1716
        rtnl_setlink+229
        rtnetlink_rcv_msg+300
        netlink_rcv_skb+80
        netlink_unicast+513
        netlink_sendmsg+579
        sock_sendmsg+94
        __sys_sendto+238
        __x64_sys_sendto+37
        do_syscall_64+51
        entry_SYSCALL_64_after_hwframe+68

OK, now we are getting somewhere. It looks like whatever is setting the MAC address for us is doing so via the netlink interface, specifically rtnetlink:

Rtnetlink allows the kernel's routing tables to be read and altered. It is used within the kernel to communicate between various subsystems, though this usage is not documented here, and for communication with user-space programs. Network routes, IP addresses, link parameters, neighbor setups, queueing disciplines, traffic classes and packet classifiers may all be controlled through NETLINK_ROUTE sockets. It is based on netlink messages; see netlink(7) for more information.

Inspecting rtnetlink signaling

I spent some time trying to figure out if there was a good way to monitor the rtnetlink messages, possibly being able to tell me what the source of the message was. I found the nice Debugging netlink requests post by Julia Evans which enumerated multiple options for investigating them. I was however not able to figure out the source of the messages this way. Learning about the monitor mode for ip was pretty cool though, showing the following when adding bond0:

# ip monitor
inet bond0 forwarding off rp_filter off mc_forwarding off proxy_neigh off ignore_routes_with_linkdown off
inet6 bond0 forwarding off mc_forwarding off proxy_neigh off ignore_routes_with_linkdown off
19: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
    link/ether e6:f3:6a:ad:a8:5a brd ff:ff:ff:ff:ff:ff
19: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
    link/ether 42:e2:89:5b:6d:de brd ff:ff:ff:ff:ff:ff

This made it clear that the bond0 interface actually was created with a random MAC in bond_init() as my previous printf-debugging showed, but that a follow-up modification then overwrites it with that 42:e2:89:5b:6d:de address.

A quick aside

I wish I could tell you that this step-by-step sleuthing deterministically brought me forward and eventually gave me the obvious solution. But at this point I was many hours deep into this rabbit hole: wrapping my head around the bonding driver, having built multiple custom kernels, learning about the Dynamic debug stuff, trying to figure out the operation of rtnetlink messaging and trying to make bpftrace grant me the X-ray vision I was sorely needing. At this point I was tired and tried to lean back for a while. This is when I casually pondered "what else might be having ideas about interfaces", and it dawned on me: udev.

Can the real dev_set_mac_address() caller please stand up?

Let's quote us some Wikipedia, namely the udev article:

udev is a generic device manager running as a daemon on a Linux system and listening (via a netlink socket) to uevents the kernel sends out if a new device is initialized or a device is removed from the system. The udev package comes with an extensive set of rules that match against exported values of the event and properties of the discovered device. A matching rule will possibly name and create a device node and run configured programs to set up and configure the device.

While my interaction with udev has generally been kept sparse over the years, it did feel like a good suspect. First of all, is it running?

# ps auxww | grep udev
root         261  0.0  0.0  21668  5404 ?        Ss   14:55   0:00 /lib/systemd/systemd-udevd

Hello there systemd-udevd. It was at this point I was crossing my fingers, reaching for my old companion strace:

# strace -f -p 261
[...]
6240  sendto(15, {{len=44, type=RTM_SETLINK, flags=NLM_F_REQUEST|NLM_F_ACK, seq=2, pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=if_nametoindex("bond0"), ifi_flags=0, ifi_change=0}, {{n
la_len=10, nla_type=IFLA_ADDRESS}, 42:e2:89:5b:6d:de}}, 44, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 16 <unfinished ...>
[...]

There it was. A rtnetlink RTM_SETLINK message, containing that all too familiar 42:e2:89:5b:6d:de address. It turns out it is systemd-udevd generating a "persistent" address when it sees the creation of bond0. Why was this not a problem on the Buster machine? Well, a quick look with journalctl on that machine revealed the following:

Sep 25 04:24:33 hostname systemd-udevd[27065]: Could not generate persistent MAC address for bond0: No such file or directory

This error is discussed in this ticket in the systemd GitHub repo. Probably the fixes has eventually landed in Bullseye and here we are. This seems likely as the logs on Buster includes this information:

Sep 25 04:25:54 hostname systemd-udevd[27123]: Using default interface naming scheme 'v240'.

... while the Bullseye logs look like this:

Sep 24 14:18:00 hostname systemd-udevd[4156]: Using default interface naming scheme 'v247'.

... and based on the naming scheme history available at systemd.net-naming-scheme it seems like this was introduced in v241:

v241

MACAddressPolicy=persistent was extended to set MAC addresses based on the device name. Previously addresses were only based on the ID_NET_NAME_* attributes, which meant that interface names would never be generated for virtual devices. Now a persistent address will be generated for most devices, including in particular bridges.

Note: when userspace does not set a MAC address for a bridge device, the kernel will initially assign a random address, and then change it when the first device is enslaved to the bridge. With this naming policy change, bridges get a persistent MAC address based on the bridge name instead of the first enslaved device.

Funnily one of the comments from the github ticket has this to say:

For those who are having problem with this, you should be able work around it by copying /usr/lib/systemd/network/99-default.link to /etc/systemd/network/99-default.link and replace MACAddressPolicy=persistent with MACAddressPolicy=none in the latter, which should prevent udev from doing anything relevant. [...]

This is when I let out a sigh of relief: I had found the culprit and the world made sense again. The only thing that bothered me was the way I actually discovered the source of the behaviour. I basically made a leap of faith into udev and got lucky. That didn't feel right.

You promised me X-ray vision!

I figured there must be a way to have bpftrace tell me what was going on, and while writing this post I realized I could just do this:

# bpftrace -e 'kprobe:dev_set_mac_address { printf("%s: %s\n", comm, kstack); }'
Attaching 1 probe...
systemd-udevd:
        dev_set_mac_address+1
        dev_set_mac_address_user+45
        do_setlink+1716
        rtnl_setlink+229
        rtnetlink_rcv_msg+300
        netlink_rcv_skb+80
        netlink_unicast+513
        netlink_sendmsg+579
        sock_sendmsg+94
        __sys_sendto+238
        __x64_sys_sendto+37
        do_syscall_64+51
        entry_SYSCALL_64_after_hwframe+68

Face: meet palm

As can be seen above all I would have needed to do was to print another of the builtin bpftrace variables, comm. This way the offending party revealed itself without any need for random guessing. I recall that the introductory chapter of the BPF Performance Tools book contains the following quote on the subject of BPF tracing visibility:

It can feel like having X-ray vision: When you need to examine some deep kernel component, device, or application library, you can see into it in a way that no one ever has before - live and in production.

As you can probably tell, my use of the word "X-ray" in this post is fully inspired by that sentence, but more importantly this is exactly how it feels. The debugging potential of eBPF tools like bpftrace are off the charts. Here I poked around with the kernel probes, I have not even looked into probes aimed at userland applications. It is well worth a look by anyone wrangling misbehaving silicon for a living.

September 26, 2021

A Bonding Exercise