Are you backing up your AIX systems over Virtual Ethernet adapters? Of course you are, who isnt right? Are your backup server and clients on the same physical POWER system? You are most likely backing up over Virtual Ethernet to another AIX LPAR that is running your enterprise backup software, such as TSM or Legato Networker for example. And you probably have a dedicated private virtual network (and adapters) on both the clients and the server to handle the traffic for the nightly backups. The next question is, have you tuned your Virtual Ethernet adapters?
There are several tips available for tuning your Virtual Ethernet adapters for better performance on AIX. These tips include changing settings such as MTU size, TCP window sizes, enabling largesend, etc. I highly recommend the following blog posts from Anthony English and Nigel Griffiths on this subject:
OK, so you got everything humming along nicely, your backups are flying over the virtual network (across the POWER hypervisor) and everybody is happy. After a period of time, you notice that the backups have started to slow down. They are taking longer to finish. The overall throughput of a backup drops. Some backups start in the evening around 9pm and are still running the next morning at 7am! In some cases you need to kill the backups or even reboot the backup server LPAR for things to return to normal.
What is going on!? You cry.
Well, there are a number of reasons why this could be happening. For example, your shared processor pool may be overwhelmed during the backup window. As we know, Virtual Ethernet adapters require CPU to do their work. If the CPU pool is running low on available CPU resources, this could contribute to the problem. And of course there could be tuning issues with the Virtual Ethernet adapters or the AIX OS in general. Or there may be issues with other pieces of the infrastructure, like network and SAN switches, adapters, etc. Perhaps theres an issue with the applications and/or databases on the AIX systems? They often have their own mechanisms/tools for backing up their data to your enterprise backup software. Is the backup server sized to cope with the load i.e. CPU, memory, disk layout and I/O, sufficient tape drives, disk storage pools, etc?
So assuming youve checked all of the above (and more), then perhaps youve hit a problem that I encountered recently. In my particular case, backups over the hypervisor were slowing down, without any discernible cause. Initially the backups would be very fast but after a month or so, things would start to slow down dramatically.
We noticed that there were very large (and increasing) values for Packets Dropped, Hypervisor Send/Receive Failures and No Resource Errors in the output from the netstat v command.
ETHERNET STATISTICS (ent1) :
Device Type: Virtual I/O Ethernet Adapter (l-lan)
Hardware Address: 41:ba:13:e7:25:0b
Elapsed Time: 42 days 4 hours 3 minutes 34 seconds
Transmit Statistics: Receive Statistics:
Packets: 5978589961 Packets: 26139832411
Bytes: 779465989202 Bytes: 711051516630458
Interrupts: 0 Interrupts: 6804561727
Transmit Errors: 0 Receive Errors: 0
Packets Dropped: 0 Packets Dropped: 86012309
Max Collision Errors: 0 No Resource Errors: 46113807
Hypervisor Send Failures: 0
Receiver Failures: 0
Send Errors: 0
Hypervisor Receive Failures: 46113807
After some discussion with IBM AIX support, we discovered that would should increase some of the buffer sizes for our Virtual Ethernet adapter (the entX device). This would alleviate the no resource issues wed been experiencing. Looking at the output from the netstat -v command, we also noticed that the Medium, Large and Huge buffers had all reached their maximum values in the past.
Buffer Type Tiny Small Medium Large Huge
Min Buffers 512 512 128 24 24
Max Buffers 2048 2048 256 64 64
Allocated 513 535 148 28 64
Registered 512 510 127 24 13
Max Allocated 576 951 256 64 64
Lowest Registered 502 502 64 12 11
The advice from IBM support was to increase these buffers using the chdev command (they also advised that we should reboot for the changes to take effect):
# chdev -l ent1 -a min_buf_medium=512 -a max_buf_medium=1024 a min_buf_large=96 -a max_buf_large=256 -a min_buf_huge=96 a max_buf_huge=128 -P
# shutdown -Fr
Since implementing this tuning change (to the adapter on the backup server), we have not had a repeat of the problem. We will continue to monitor the performance and Ill be sure to let everyone know if we have further issues.