Late update.. actually the 23rd, explained below.
A year ago-ish I said: ...notice of surrender
This update is late - not one year exactly because: when I was starting to document things for this update (number of servers, uptime of each, bans/blocks, lines of custom code, number and types of off-the-shelf apps, ononon) - one of the customers updated their web/php page(s) with a highly breakable piece of php code that was immediately accidentally exploited and completely trashed a server. The server in question had been up 142 days and I had stopped the every-15-minute-cronjob that checked *everything* and reset anything *fishy* back in late October 2010. For the technocrats out there: the php basically allowed a one entry box form, not validated and attempted to email a HUGE file to that email address - wrapped in a bad while loop. This madness caused the mail to be regenerated at full computer speed and press into the "mail queue" with no mailto address until the queue overran, syslog died, RAM and swap filled up and overflowed, kernel panic (ed), and the end result: reset button - before the http/php could get blank submitted again - go find the bad code and disable it. So this update is late, and all aggro-rambling instead of neat and bullet pointed like I intended.
By October/November 2010 the aggregate inbound connects had been reduced from 5-10Million/Day to 300,000-1Million/Day.
3 Domains were *surrendered* and used as bait to draw down the DDoS,sync,flood(s). Since all three domains are now expired I will reveal the network security trickery used to reduce or 'backfire' the DDoS. All three domains were given (registrar) and 2 other name servers. The 2 name servers we/I used were virtual computers with only DNS port(s) open, on extremely throttled bandwidth (slower than a 300baud modem), and all entries pointed to 'local' (192.168) or 'loopback/local' (127.0) ip addresses. This allowed for the zombies and badware-bots, who's original authors don't seem cafeful enough not to target their own bots/machines by number (they just abuse domain names), to semi-self destruct trying to ping/email/inject local or loopback addresses resolved from the FQDNs. The virtuals are offline and deleted, the domain names are expired, and the 500+/-baud of bandwidth is back in use elsewhere.
A windows xp pro server, syslog and custom log reading-ware has also been retired. Approximately 20GB of compressed logs, and a 'few thousand' lines of Visual C code have been put away in the archives - hopefully never to be seen again.
Still in 'the racks': 1 gentoo server, 3 CentOS Servers, 1 Windows (version private for security) server, 1 Windows XP 'server'.
RAW has 2 servers offsite, both Windows, and one in 'the racks' (windows also, versions private also).
SBS put a virtual server up offsite for reasons - again - private for security.
There is a rediculous number of smart-switches, routers, firewall devices (one Cisco, the rest private) in MLD's 'horde' also. The 'racks' means that the 'real racks' overflowed and there are server(s) on tables, on plastic storage boxes, and one bunch of spaghetti and a UPS on a chair.
I would detail things further, for the assistance of other admins managing (suffering) domains under assault - but I can't risk giving the crackers too much information. I/We barely survived a huge run of dumb-bots running the wire - don't need smart-bots taking a stab at us.
There is a total of 8TB exclusively for backups (4TB in each backup device) showing approximately 60% used. I tried to count on my fingers the number of hard drives and and disk space in use in the 'live' environment, I gave up. Somewhere around 10-12 physical drives, various sizes, adding up to just over 3TB of 'live' environment.
In the 100,000/day to ten times that per day 'eased' environment of February 2011 - 90% of the current abuse is what I call the self-inflicted gunshot wound. This is URI and DNS Black-Listed email that we have to allow thru/in based on end-user 'whitelisting.' This is where some-end-user crawling the internet happens across the world famous webpage or phish that says Win FREE* Millions of Stuffs* NOW* FREELY* FREE* just enter your email address and click You Are A Winner button. And, the webpage tells them, you MUST WHITELIST blahblahwinnersclickwinboxesdawtru to be sure to get your FREE* Millions* of OFFERS* for FREE* STUFFS* at a DISCOUNT* super-duper special key-code in email... and don't forget to get all your friends to sign up, too, you'll get a FREE* PhD in FREE*eMail*Tronix (* PhD means PhisheD - not an actual doctorate). So, some-end-user did not get their superduperKey in eMail and we get email, phonecalls, sometimes yelled at, and 100,000-1,000,000 log entries per day like: blahblahdomain URI and DNS BL (spam/abuse) - passing because whitelisted. I say it with big love some end user, don't be mad.
I don't guess it gives away too much to give some credit to the out-of-the-box, off-the-shelf and especially the extremely useful software packages that have been utilized in this digital fort defense:
One 'final note' section about fail2ban. The version(s) of fail2ban I got started with kept its own, in RAM, internal form of list/array of banned IPs, when they got banned, when the ban would expire. The fact that fail2ban is written in a combo of perl and python (interpreted, not compiled) made this 'interpret the list' system very loady on the system (big CPU and RAM use). The way around this (at the time) that I used:
- fail2ban filter triggers a php script to load ip, banstart, bantime into a mysql database (the action is a script)
- fail2ban self-expires (removes from its internal list) the ban in 2 minutes
- cron fires every 10 minutes (or 15 on some servers) to do 'actual' ban-removal from the much-more-efficient php/mysql (another php script)
* My way put fail2ban back into the 300-400 bans max in a 2-minute-range range, rather than fail2ban sifting-and-sorting-and-testing a sometimes 100,000 or more list of bans. In early 'let fail2ban handle it' trials CPU load would get to 80+ per-cent, and a 4GB RAM system was constantly 'going to swap.' Letting mysql do the data storage and selection, php or compiled c (.o) do the ban/unban (notify other devices, log, sometimes modify a local (software) firewall) put the system load back in the 1-2GB RAM and 4-10% CPU use range.
- In/On a system under attack - use logrotate to keep the logs that fail2ban 'reads' or 'watches' down to very small sizes, this helps also.
* * * Feb 24, 2011 * * *
Cascade Effect(s)
Just a 'quick followup' - not really DDoS releted - but that php/script/mail error related. BOOM!
You know what happens to a linux server when things start to go crazy?, when there is panic in the kernel?, when there is daemons and services killed by the kernel to free up resources? BOOM! happens, and then it cascades. (1)Apache/PHP go nutzo and take out DNS (thereby corrupting DNS and its affilated cache). (2) mailserver (qmail in this case), apparently unable to DNS from [local] to [blank] starts spazzing out and queue-ing and defer-ing to infinity and beyond (thereby further corrupting DNS + affiliated cache and the spam/virus/email-conglomerate-daemons). (3)Somewhere in the few seconds it takes for a server-snake to devour its tail: sshd, various devs/udevs, iptables, and mysql get sig'd to die, and they do, self-destructing their config and data files (so as to go out with a bang, no quietly into the night for production servers). (4)Sometime after a hard power-off and server-hup, and things seem to have recovered themselves - like oil on the pond - corruption spreads slowly and steadily and gets you 5-ish more hours putting a server back together basically one .conf file at a time, one port INPUT/OUTPUT in the firewall at a time, one corrupted beyond repair program removed/reinstalled at a time....
The beast is back to normal operations now - email that got 'defered' at 8:00am is finally de-queue-ing and delivering at 7:30(ish)pm.