By Mark Bowytz

It was not the best start to a Monday morning. When Chris K. got in, the entire IT department was in full-panic mode because the Linux mail server that he administered was unresponsive.

And to make matters worse, he couldn't even get to the bottom of the issue. Almost every other minute, his phone would ring with someone, somewhere asking for a status update. When he turned his phone off, people started showing up at his desk. And then there were the status meetings where all he could report was "still broken." Finally, in desperation, Chris reserved an all-day “meeting” in one of the lesser used conference rooms, undocked his laptop and ducked for cover.

Once Chris was able to get his hands on the server, the system was exactly as he expected - an complete and utter disaster.

The mail server had a separate 80GB partition just to hold email; and it was completely filled. The queue directory contained such a preposterous number of files that wildcards such as ? and * could not expand. That meant there was no immediate way to list only the Sendmail "q*" files, which would have contained some clue as to what went wrong. Worse still, the postmaster inbox was filled to the 2GB limit and clearing out the messages was like bailing water with a sieve: each message deleted was replaced with a fresh new 'delivery error' message. The whole mail system had to be shut down to work on the problem.
Getting Down to the Problem

After a few hours in the cramped meeting room, Chris finally worked out what happened. On the prior Friday, a program manager had been trying to send a 76MB file to several people on the west coast via email. This, naturally, caused her to hit the imposed 10MB per message size limit.

She had been explicitly ordered by the director to send the file via email and include read receipts to make sure it had been received. To get around the quota, she had broken the file up into 11 different 7 MB files using zip spanning, and then sent these to the 16 recipients as requested.

When she called to find out if the messages had made it, several recipients reported that they hadn't (though, they didn't mention it was because they had hit their mailbox quota), so she resent all 11 emails to those people again, and went home.

Over the weekend Sendmail dutifully tried, and retried, to send the message once every hour and, if the message bounced, an error message including the entire original message and attachment was sent back to the sender. This repeated until the sender's inbox swelled to 2 GB, whereby error messages were sent to the postmaster inbox. When that hit 2 GB, more errors went to the root account. When the root account filled up, all of the error messages for the sender, postmaster, and root, all started to clog up the queue, which then filled up the entire partition. Finally, the mail server then spent all of its CPU time trying to send errors to a partition that was full.
Addressing the Problem

Even in his self-scheduled Fortress of Solitude, it would only be a matter of time before the next wave of status seekers would finally find and descend upon him, so Chris knew he would have to act fast. Chris quickly whipped up some shell scripts to seek out and destroy the offending error email messages as well as the associated emails that would have continued the problem. All told, the operation took several hours but the server was back up and running, happily serving out new emails an hour before the end of the day. He had fixed the issue, but Chris knew that this could easily happen again.

Chris approached management with the idea of Sendmail gradually scaling back delivery attempts but they weren't buying it, nor was sending the error without the entirety of the message deemed acceptable.

Instead, they reached a compromise - a script would run as a cron job to find and delete queued messages that had 50 or more sending attempts. Over the following months, the script's output log showed that this happened more frequently than expected. In fact, it was by sheer luck that the server hadn't been brought to a screeching halt much sooner.

Fast forward a few years and Unix admins later, with more users with email addresses sending larger attachments more frequently, the problem somehow happened again on the same server which was still in use. After years of paranoia concerning losing even a single message, this time the fix was much simpler - delete the entire mail queue, and explain to everyone "Sorry, you lost some mail."