A number of years ago, my employer began replacing its various phone systems and Centrex solutions with a single PBX to provide uniform services at all locations. Because our IP infrastructure was… well, “old as dirt comes to mind”, we decided to also replace nearly all of that as part of the same project, because it meant that funding was more readily available. At the time, I was the only person (of our IT staff) who had any experience at all with VOIP, and I was also the only one who understood DHCP and DNS, and how they interact. As a result, I informally became the project lead on the customer side; I made the vast majority of technical decisions, did a significant amount of the customer-side work myself, and closely monitored the work that our vendors were doing.

My coworkers didn’t understand the benefits of DHCP, very deeply mistrusted it, and were opposed to its deployment. Having previously worked in environments where DHCP and DNS “just worked” (whether they were implemented by me or others), I couldn’t understand why they held that opinion; despite asking repeatedly, nobody could ever provide an explanation that I was able to understand. (In the years since, I realized that I wasn’t able to understand it because I was depending upon their ability to understand and properly use DNS, and that is also not the case. They still, to this day, for reasons that I will never be able to comprehend, access as much as possible via IP address rather than FQDN.)

Because I had only worked for that organization for about a year, I hesitated to steam-roll over my coworkers opinions regarding DHCP and DNS, but also couldn’t fathom any alternative. At one point, I made a post on ServerFault: DHCP and DNS services configuration for VOIP system, windows domain, etc. I was extremely careful in my phrasing, because I already knew what I wanted to do, but wanted someone I didn’t know to validate my plan, in a way that I could at least print out and hand to my boss for justification in event he fired me for ignoring literally everyone else’s opinion (something I’ve had to prepare for time and time again). As you can see at that URL, Evan Anderson responded almost immediately, and while it wasn’t the /exact/ answer I had in mind, it was close enough that I was comfortable proceeding with my plan.

The Avaya Aura solution that we were deploying was designed such that redundant equipment in the primary datacenter would control/coordinate all voice traffic for the organization; if that cluster (or the entire site) failed, all phones and other voice equipment would re-register and be controlled by equipment located in a second site. If both sites or the necessary equipment in them was unavailable (say, in the case of a fibre-seeking-backhoe removing WAN service from a remote site), certain sites were determined to be “critical”, and were given equipment such that they could effectively stand-alone and continue operating without the WAN, for some not-short period of time. By “not-short”, I had been told that the services could theoretically continue to operate indefinitely; In practice, those survivable sites will reboot themselves due to a license violation (because they can’t speak to the license holder at the primary site) after some number of days, but will then continue to operate after they’re done rebooting. Some sites were give equipment such that if they enter “survivable mode” that they are only able to dial amongst themselves (internally) and to 911 (emergency services in the USA); others were additionally given the ability (via ISDN PRI) to continue connecting with the PSTN (meaning that they could effectively continue operating normally, except for voicemail).

Logically extending from that, one can see that for those sites where survivable equipment was determined to be necessary, IP services in general (switching, local routing, DHCP, DNS, and anything else vital for the operation of the local network) must also continue to be available in event of WAN outage. This precludes the use of DHCP ip-helpers and anything else that must cross the WAN. I was surprised to learn that DNS was, in fact, not required for the operation of the Avaya Aura equipment (at that time; this has now changed), so I didn’t need to plan for DNS to work in sites where they were only survivable for emergency services. Avaya Aura phones require a web server on which they store their station configuration and preferences, but I was told that there is no user impact if the server becomes unavailable, so I didn’t need to plan for it to be available in any case of outage.

So here’s what I did:

The web server required for the phones to store their configuration and preferences, and to retrieve any firmware updates, lived only at the primary datacenter. If everything is up and working, the phones can backup to or restore from that server whenever they deem necessary; if it’s not available, they will continue running their last-known-good configuration indefinitely. In practice, the only problem with this is that the phones display a message “Backup failed” for a second or so whenever they attempt to perform a backup and are unable to do so. There is therefore little to no user impact of this service being unavailable.

DNS was handled by Windows DNS servers, using only ADDS-integrated zones. For those sites that required survivability for only emergency services, no ADDS/DNS server was present, and DNS would fail in those locations during an outage. This had no impact on voice services performance, but of course prevented anyone from doing anything using their computers. For those sites requiring long-term survivability with minimal restrictions (including the primary and secondary datacenters), ADDS/DNS servers were already present.

DHCP was somewhat more complex. I determined that it would be handled by ISC’s DHCP at all sites. This solution permitted me to use failover configuration between servers, where I so chose. It’s FOSS, so there were no licensing fees to use it. For hardware to run these, I got a really, really good deal on Dell PowerEdge T105 servers - very much below list price. I picked CentOS 5 as the preferred flavor of Linux. After one anaconda/kickstart config, a few bash scripts, and a weekend of physically shuffling servers around, we had identically configured DHCP servers for all sites (yes, all – I couldn’t get the concept of survivability across to a few people, so we wound up with non-survivable sites still having their own DHCP server).

ISC’s DHCP, as I’m sure none of you will be surprised to learn, does not have any graphical user interface. Like with most nix-like services, all of its configuration is done via text files. That’s perfectly fine with me – I do as much as possible from the command line. Text user interfaces are kind-of my thing. My coworkers, however… notsomuch. I very quickly developed a database that held all of the information necessary to create all of each ISC DHCP server’s configuration, and bash scripts that each server used to gets its latest configuration. I threw all of that behind the most rudimentary of web forms, so that it could be updated from within a web browser, rather than at the command line (so that I wasn’t the only one in the office who could update it). Because I wanted to eventually open-source that, and because I hadn’t gotten a formal opinion regarding code-release for things I developed on-the-clock, I developed all of that on my own time, at home.

I implemented everything I described above, and it worked reliably, behaving and failing in ways I expected. If I knew then what I know now, I would have pushed very, very strongly to acquire and deploy a COTS IPAM solution – not because what I did failed, but because I invested a lot of personal time in this project, and it wasn’t until late-2014 that we employed anyone else who was ready and willing (without passive-aggressive arguments) to use Linux for anything. But when I was designing this, I was told that we had no budget for it - it took a /lot/ of time and effort on my part just to get the money to buy the PowerEdge T105s (and they were cheap).

In March 2014, Microsoft clarified their requirements for Windows CALs; among other things, it makes clear that any non-web-workload (such as DHCP or DNS) requires a CAL. We do not have any way to predict either (a) how many people are using our phone system or (b) how many unique devices are using our guest networks; the result is that we cannot use Windows for non-web-workloads exposed to the public (including phones, printers, guest networks, etc). Windows will continue to provide DNS for anything connected to our private/internal networks, where we are able to determine approximately how many people must be licensed.

Skip from 2009 to mid-2014. The PE T105 servers were about five years old and starting to fail. Avaya announced that they were discontinuing support for the version of Aura that we were running, because of operating system security concerns. This led me to re-evaluate all of the above, while considering implementation changes in the latest version of Avaya Aura. The biggest difference is that DNS is now required for it to be able to operate. We were using ADDS for DNS everywhere, but there were no servers present in the locations that were survivable only for emergency services, and we didn’t have a budget for new Windows servers for those locations.

In concert with the Avaya Aura upgrade, I implemented changes to the above design: DNS services would additionally be provided by the servers that are providing DHCP services, and there would be a separate DNS zone for all voice equipment and access. One server at the primary datacenter would be the master, and all of the other sites would get a copy of the zone periodically and serve it. There is no need for dynamic zone updating, so I’m not concerned about that.

I haven’t made this change yet, but I plan to in the near future: the web server that the phones store their configuration and preferences on, and get their firmware from, should be local to the site. Phones only check for firmware updates on boot, and phones rarely reboot – when they do, it’s generally an entire site at a time (as a result of a long-lasting power failure or intentional PBX command to reset the entire site). The effect this has is that this server has very little traffic (by bandwidth) the vast majority of the time, but when a firmware update is published, it will completely saturate the smallest-throughput link between that server and the site that’s rebooting. During the mid-2014 Aura upgrade, new firmware was published, and it took three sites more than six hours (in the night, so no user impact) to return to service just because it took that long for all of their phones to download the updated firmware. Making the server local to survivable sites will cause this to be a non-issue. It will also prevent the rare call regarding the “Backup Failed” message, if there’s some problem preventing the centralized web server from responding.

By late-2015, more than a third of the PE T105 servers had failed, and we’d been scavenging them for parts, to keep as many as possible in service (we bought extras of them, up-front, but enough have failed that we now have “random” other computers serving in this capacity in some sites). I was way beyond tired of this situation, so went looking for a way to replace all of the servers at once, again with everything being the same model, so that we could just do a forklift upgrade some weekend. Netgate’s RCC-VE 2440 was brought to my attention by a sysadmin friend, and it seems to be an ideal fit. As opposed to the tower form factor of the PE T105, the 2440 is tiny - about the size of a consumer-grade router - and CentOS 7 is supported on them. I was able to get a budget for enough of those to put one in every survivable site. These will run DHCP (for everything), DNS (for voice equipment), and HTTP (for the phone firmware/config/preferences) in each site.

While I have sent the code for the “IPAM system” I developed to a few people who have asked about it, I haven’t yet put it on Github (or any other public site), mostly because I’m ashamed of the exceedingly poor code quality. I didn’t know nearly as much when I originally wrote it as I know now, and haven’t the time to re-write it, since it “just works” (and security considerations are minimal since it’s only accessible to a select few individuals). I may release it in the future… who knows. If you’re interested in seeing it, send me an email, and I’ll send it to you.

Today is January 16, 2016, which is the Saturday beginning a 3-day weekend (my office is closed this coming Monday due to Martin Luther King day). I’m planning to spend part of today doing a trial configuration of a few of the RCC-VE 2440 servers, writing up a document of what I did and how to recreate it. Assuming everything “just works” next week, I’ll give the instructions to a new coworker - one who embraces Linux and the command line - and have him deploy the rest of them in a couple of weeks.