Mail outage

Posted on: 2014-05-07 14:48:16+00:00

During the afternoon of May 6th we began experiencing delays in mail delivery of 1-2 hours. Initial efforts at remediation seemed to clear this up but on the morning of May 7th the problem worsened and we proactively disabled mail service to deal with the failure. The underlying hardware suffered failures on multiple disks. This outage effects all ASF mailing lists and mail forwarding.

 This service is housed at OSUOSL, and we are currently waiting on smart hands to help with replacing hardware. Our expectation at this point is that we still have multiple hours worth of outage. 

 Incoming mail is currently being received and held in queue by our mail exchangers. We also have a copy of the existing queue that hasn't been processed; so we expect no mail or data loss.  

ASF Infra's twitter bot will provide updates as we have them for the duration of the outage. Feel free to follow @infrabot on Twitter. There will be an update on this post as well as the situation progresses.

UPDATE 7 May 19:27 UTC - Drives have been replaced, array is attempting to rebuild. As indicated earlier on twitter, there likely remains hours of outage.  

UPDATE 7 May 20:44 UTC - The disk array is still in the process of repairing. Several hundred mails were processed during a reboot, but more work remains before service is restored.  Mail service has been disabled again as the array repair process is CPU-bound. The plan going forward is to allow the disk arrays to finish repairs. Once that is complete, we'll reenable the mail service and flush what is currently in the queue. Finally, once the queue is empty we'll begin receiving mail again.

UPDATE 8 May 05:00 UTC - The disk array failed to repair itself. The disks have been replaced and a new installation has been completed. Progress continues to be made towards resolution, but nothing firm enough yet for us to predict an time for restoration.

UPDATE 8 May 15:45 UTC - No material change of status has occurred. Infra worked in shifts around the clock last night and continue to do so to restore service. More updates as they become available.  

UPDATE 9 May 11:20 UTC - We are working on temporarily restoring the most essential email aliases. In the meantime, inquiries may be made to infrastructure@apache.pw or on our IRC channel, #asfinfra on Freenode. The work on restoring the service in full is still ongoing.

UPDATE 9 May 17:20 UTC - We've successfully restored a host from backups and will be starting testing soon. Based on the progress made in those tests we'll try and provide expectations around restoration of service timeline.

UPDATE 10 May 15:45 UTC - We've started pushing live mails through the system - you'll begin to see them trickle in as we gradually open the floodgates to restore service. Expect intermittent spurts for a while. 

UPDATE 10 May 21:55 UTC -  The floodgates have been opened.  As we have a significant amount of backlog to catch up on, please be patient as the service does this.  As always feel free to contact us if you have any questions. In the immediate short term (next day or so, we suggest you continue to use infrastructure@apache.pw and our IRC channel, #asfinfra on Freenode.  We would like to thank you for your patience during this extremely busy time. 

UPDATE 12 May 16:04 UTC - Clarification - we have opened the floodgates, but have a substantial amount of backlog; and with the sudden rush of mail are being throttled by various mail services. With the addition of mail that's coming through anyway; it may take us from 2-5 days to fully flush the backlog. This time is so wide because of a wide variety of factors that are largely outside of our control, such as new mail coming in and mail services individual throttling policies.  

Copyright 2024, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation...