September 5, 2017

Atomic DNS updates with Infoblox

This post describes how you can achieve atomic DNS updates when using the Infoblox system, something that was not obvious to me when I first started using it.

Some background: A while ago I migrated the host management system at work from a home grown service called HOSTDB to the appliance based Infoblox system. Describing the migration more closely is probably a blog post in itself and might be something to go back to in the future.

Coming from a background with the HOSTDB system mentioned above, changing zone contents followed this workflow:

  1. Update the host content either via a web interface or in plain text files handled with CVS.
  2. Issue a reload of the given zone or network.
  3. The reload would trigger rebuilds of the affected configuration files and reload services as necessary.

The necessity for a reload meant that you could sequentially update multiple entries without this leaking into the published zone before all changes were completed. When issuing a "reload" the complete set of changes would be published simultaneously.

The Infoblox system is a lot more dynamic than that. Adding or removing DNS zones and modifying DHCP information requires a reload to take effect, but basically any change to the content of existing DNS zones will go live the moment you make them.

This becomes troublesome in some cases like when you want to replace one record type with another (going from a standalone A record object to a Shared record group supplying a shared A record), or move information between objects (like moving a Host object alias to another Host object).

Trying to assign a zone to a shared record group when there are conflicting objects in the zone results in a "Duplicate object in the database" error. Trying to assign a Host object alias twice leads to an "Operation not possible due to uniqueness constraint" error. This means that you first have to delete the conflicting information before you can assign the new configuration.

Performing a delete operation prior to the addition of course means that for a short duration the data will not exist in the published zone, meaning that there is a small window where unlucky DNS resolvers will cache the NXDOMAIN or equivalent result, disrupting service for an unknown amount of internal and external users.

You may argue that you can limit the problem this causes by lowering TTLs, but you have no guarantee that downstream DNS resolvers will honor this. Also, having to tell users "there might be a small chance of downtime which might get cached in servers around the world" is a lot more annoying than being able to state "the change will not be noticed".

The question then becomes how to achieve such changes in an atomic way. Contacting Infoblox support I have been told that RFE-843 (Option to convert an RR in the zone into a shared record) exists. In the process of talking with support they also created RFE-5137 (Ability to associate an alias with a new host without disassociating the alias from the previously associated host record).

Turning to IRC where I am lucky to be in contact with an Infoblox employee, I was informed about the "request" object available in the REST API (WAPI). It turns out this does indeed allow atomic changes in the DNS data if you are willing to write some code.

Here I will show you how it looks when working with the request object versus doing separate DELETE and PUT calls. The code is based on our homemade WAPI-wrapping perl module: SU::API::Infoblox

The goal of the code is to replace an existing standalone A record object with the assignment of a shared record group which supplies the same (shared) A record.

This first version does the DELETE and PUT calls separately, like you would do if the "request" object did not exist. This causes the record to be missing from the zone for a short duration. The $srgs variable contains the current list of Shared Record groups that the related zone is assigned to:

push @{$srgs}, "shared-group-name";
my $put_data = {
    srgs => $srgs
};

my $delete_result = $infoblox->do_request("DELETE", $record_a_ref);

my $put_result = $infoblox->do_request("PUT", $zone_auth_ref, "", $put_data);

The resulting named(8) logs look like this:

2017-01-16T11:06:29+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applying transaction 3093.
2017-01-16T11:06:29+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied DELETE for 'testrecord01': 3600 IN A 127.0.0.1 (none).
2017-01-16T11:06:29+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied ADD for '': 3600 IN SOA ipam.example.com. hostmaster.example.com. 124 14400 3600 2419200 300 (ro ).
2017-01-16T11:06:29+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied transaction 3093 with SOA serial 124. Zone version is now 124.
2017-01-16T11:06:29+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applying transaction 3094.
2017-01-16T11:06:29+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied ADD for 'testrecord01': 3600 IN A 127.0.0.1 (ro ).
2017-01-16T11:06:29+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied ADD for '': 3600 IN SOA ipam.example.com. hostmaster.example.com. 125 14400 3600 2419200 300 (ro ).
2017-01-16T11:06:29+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied transaction 3094 with SOA serial 125. Zone version is now 125.

Note that the DELETE and ADD database operations are carried out in separate transactions (3093 and 3094).

Doing this with the "request" object looks like this, note how the DELETE and PUT operations are grouped in a single POST call:

push @{$srgs}, "shared-group-name";

my $request_post_data = [
    {
        method   => "DELETE",
        object   => $record_a_ref,
    },
    {
        method   => "PUT",
        object   => $zone_auth_ref,
        data     => {
            srgs => $srgs
        },
    },
];

my $request_post_result = $infoblox->do_request("POST", "request", "", $request_post_data);

And the resulting named(8) logs look like this:

2017-01-16T11:10:33+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applying transaction 3098.
2017-01-16T11:10:33+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied DELETE for 'testrecord01': 3600 IN A 127.0.0.1 (none).
2017-01-16T11:10:33+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied ADD for 'testrecord01': 3600 IN A 127.0.0.1 (ro ).
2017-01-16T11:10:33+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied ADD for '': 3600 IN SOA ipam.example.com. hostmaster.example.com. 128 14400 3600 2419200 300 (ro ).
2017-01-16T11:10:33+01:00 ns1.example.com named[11359]: zone example.com/IN: ZRQ applied transaction 3098 with SOA serial 128. Zone version is now 129.

Note how the DELETE and ADD database operations are now performed inside the same transaction (3098).

Here is a final snippet moving an alias from one Host object to another, basically grouping two modification (PUT) calls where the alias has been removed from the list in $orig_aliases and added to the list in $dest_aliases.

my $request_post_data = [
    {
        method   => "PUT",
        object   => $orig_ref,
        data     => {
            aliases => $orig_aliases,
        },
    },
    {
        method   => "PUT",
        object   => $dest_ref,
        data     => {
            aliases => $dest_aliases,
        },
    },
];

my $request_post_result = $infoblox->do_request("POST", "request", "", $request_post_data);

Here are the resulting logs:

2017-01-16T12:27:16+01:00 ns2.example.com named[11359]: zone example.com/IN: ZRQ applying transaction 3112.
2017-01-16T12:27:16+01:00 ns2.example.com named[11359]: zone example.com/IN: ZRQ applied DELETE for 'aliasname': 3600 IN CNAME host01.example.com. (ro host rrpos=0 ).
2017-01-16T12:27:16+01:00 ns2.example.com named[11359]: zone example.com/IN: ZRQ applied ADD for 'aliasname': 3600 IN CNAME host02.example.com. (ro host rrpos=0 ).
2017-01-16T12:27:16+01:00 ns2.example.com named[11359]: zone example.com/IN: ZRQ applied ADD for '': 3600 IN SOA ipam.example.com. hostmaster.example.com. 133 14400 3600 2419200 300 (ro ).
2017-01-16T12:27:16+01:00 ns2.example.com named[11359]: zone example.com/IN: ZRQ applied transaction 3112 with SOA serial 133. Zone version is now 134.

Running dig(1) in a while-loop proved that while the non-request way of doing things resulted in some bad lookups between the two operations, the request-style version worked seamlessly, always returning the expected responses.