Today I discovered why our server was bouncing back all emails. We were able to send email successfully but hadn’t received any email for 2-3 days. It was when we found a bounce message from one of our other email accounts that alerted us of the problem.
An example of the bounce message:
This is an automatically generated Delivery Status Notification
THIS IS A WARNING MESSAGE ONLY.
YOU DO NOT NEED TO RESEND YOUR MESSAGE.
Delivery to the following recipient has been delayed:
Message will be retried for 2 more day(s)
Technical details of temporary failure:
Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 451 451 4.3.5 Server configuration problem (state 14).
At first I thought we had a blacklist issue, which happens from time to time as we send large amounts of email in the form of newsletters etc.
Going through /var/log/syslog I found some interesting entries related to Postfix and Postgrey:
Jun 23 06:25:43 server postfix/smtpd[14721]: NOQUEUE: reject: RCPT from connet1.connect.uwaterloo.ca[129.97.128.124]: 451 4.3.5 Server configuration problem; from=<xxx@uwaterloo.ca> to=<xxxr@ourserver.com> proto=ESMTP helo=<connect.uwaterloo.ca>
Jun 23 06:26:48 server postfix/smtpd[16600]: warning: connect to 127.0.0.1:60000: Connection refused
My next thought was maybe Postfix was confused some how and decided to restart the whole server.
Server came up and found that now MySQL wasn’t running and all database driven websites were down. Yikes. Looking through the syslog I discovered the second issue, not enough hard drive space. A simple df -h confirmed it. 100% use on the /var partition.
I quickly removed some old log files, and crap and manually started MySQL. Everything runs, but still have the email issue.
I looked through the /etc/postfix/main.cf to look for clues and found:
smtpd_recipient_restrictions = reject_non_fqdn_recipient,
reject_unknown_recipient_domain,
permit_mynetworks,
permit_sasl_authenticated,
reject_unauth_destination,
reject_unlisted_recipient,
check_policy_service inet:127.0.0.1:12525,
check_policy_service inet:127.0.0.1:60000,
permit
Looking up info on postgrey, lead me to its config file /etc/default/postgrey which contained:
POSTGREY_OPTS=”–inet=127.0.0.1:60000″
So now we know from syslog that the connection issue is because of postgrey. I tried to restart postgrey, and discovered it wasn’t running, and failed to start. So I ran it manually in verbose:
/usr/sbin/postgrey -v –pidfile=/var/run/postgrey.pid –inet=127.0.0.1:60000
Turns out Postgrey failed to find its database. I checked /var/lib/postgrey and found it did infact have files there. Seems Postfix tells you it can’t find its database when the database is there, but unable to be opened due to being corrupt!
To test this, I moved all files within /var/lib/postgrey to a temporary location and started postgrey manually again. Started up like a charm and postgrey recreated a new database.
To make sure everything is running like new, I restarted the server. Loaded up a few websites, and sent some test emails from an external source (gmail).
Voila! I called up my clients and updated them on the issue and let them know they did not lose any email and that they would still receive email that was sent during the down time. A few minutes later, emails from the weekend started to arrive.
Tweet This
Recent Comments