Performance tuning (AWS)
Note: This page applies to SFTP Gateway version 2.x. Visit Here for documentation on version 3.x.
Overview
SFTP Gateway 2.001.x is designed to support many SFTP users, with a
high volume of file uploads.
Some customers have experienced server slowness, or it takes several minutes and multiple attempts to connect via SSH or SFTP.
If this is happening to your server, this can be a sign of more serious underlying issues. So it's important to try to address any serious issues while the server is still operational.
Disk space
If the server is out of disk space, it will become unstable. So, the first thing to check is the remaining disk space:
sudo su
df -h
You should see the following output:
Filesystem         Size  Used Avail Use% Mounted on
/dev/xvda1          32G   14G   19G  43% /
In this case, the root partition is only using up 43% of the disk.
If the root partition (/) is full, follow the instructions on
this page to free up some space.
Memory and CPU
The next step is to check the server memory. Run this command:
top
You should see the following output:
top - 21:37:48 up 237 days, 23:19,  1 user,  load average: 0.04, 0.04, 0.01
Tasks: 127 total,   1 running,  97 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8168864k total,  4576804k used,  3592060k free,   213368k buffers
Swap:  2097148k total,        0k used,  2097148k free,  1913304k cached
Note the following examples from the output above:
- Cpu usage: 0.0%usA high sustained number is a sign of trouble.
- Memory used: 4576804k usedDivide this by thetotal, and watch for a high sustained percentage.
- Swap used: 0k usedAnything other than0k(and rising) is a bad sign.
If the server is using up a lot of memory, chances are there are too many processes running in memory.
Too many running processes
If the server is using up too much memory, one common cause is that there are too many processes.
Run the following command:
ps -ef | wc -l
This will count the total processes. Anything over a few thousand could be a sign of trouble, especially if that number keeps rising.
If a server crash is imminent, you can stop all the task spooler processes.
sudo killall ts
Warning: Any file events in the task spooler will be lost, so this could result in stuck files.
Check the number of processes again:
ps -ef | wc -l
This time, the number should be much smaller.
Temporarily stop specific services
Forcefully terminating task spooler processes is just a temporary measure.
There are two services that emit events:
- cron: Kicks off a folder scans every minute for every user.
- incron: The folder scan kicks off a movetos3 process for every stuck file.
It's possible that these services are burying your server in task spooler processes. If so, you may want to temporarily stop these services:
service crond stop
service incrond stop
(Just remember to start these services later, so that SFTP Gateway can function normally.)
Performance tuning: reduce the folderscan frequency
By default, SFTP Gateway triggers a folder scan every minute for every SFTP user. This helps ensure that file uploads are promptly moved to S3.
As you start adding more users on a production system, these folder scans can add up quickly -- especially if you have not upgraded the EC2 instance class along the way.
The biggest bang for the buck in terms of performance, is to make the folder scans more infrequent.
Edit the file:
/usr/local/bin/globals.sh
Look for the following line:
DEFAULT_USER_FOLDER_SCAN_FREQUENCY_IN_MINUTES=1
Change this to the following:
DEFAULT_USER_FOLDER_SCAN_FREQUENCY_IN_MINUTES=30
(Valid values are: 1, 2, 3, 4, 5, 12, 15, 20, 30 -- any number
that divides evenly by 60)
To apply your changes, you will need to run the following command for each SFTP user:
usersetup robtest
Behind the scenes, the cron service runs a folder scan every minute for every SFTP user. There is a cron file for each SFTP user:
/etc/cron.d/robtest.userfolderscan.cron
In this file, the following line runs the folder scan every minute:
*/1 * * * * root ${BIN_DIR}/ts-folderscan ...
When you change the frequency to 30 in globals.sh and apply the change with usersetup,
you end up with the following:
*/30 * * * * root ${BIN_DIR}/ts-folderscan ...
Now, it runs once every 30 minutes.
EC2 instance class
When testing SFTP Gateway, you want to use a cheap burst instance, like a t3a.medium.
This should give you enough resources to run SFTP Gateway's Java backend and LDAP database.
When using SFTP Gateway in production, it's easy to forget to change the instance class
to something more robust like a m5.large. Because it's really not necessary to do so at first
(at least not or those first few pilot users).
But as you add more SFTP users, and as these users start uploading files, an undersized instance class will start to suffer.
A t2 or t3 instance is burstable, meaning that you get decent performance,
as long as you only draw on that performance occasionally.
Once you start adding more SFTP users, you get more folder scans, and these run incessantly and will quickly chew through your burst credits.
The solution is to move to a proper production grade class, such as an m5.large.
If you want to save money, you could go with the a1.medium if it's available in your region.
Taking advantage of a larger EC2 instance
If your SFTP users are uploading a lot of files (hundreds of thousands),
you can upgrade your EC2 instance to a much larger size (e.g. m5.2xlarge).
But you may notice that most of the CPU and memory is idle, and the number of files moving to S3 does not clear out any faster than before.
The solution is to edit the file:
/opt/sftpgw/sftpgateway.properties
And change the line:
concurrent_cloudtransfers=10
To something higher:
concurrent_cloudtransfers=15
To apply your change, run the command:
/usr/local/bin/apply-sftpgw-props
By default, SFTP Gateway will run 10 concurrent threads that move files to S3.
If you want to move more files to S3 at a time, you can increase this number.
You need to be careful though. Set this number too high, and the server will run out of memory.
So some performance tuning is necessary. Gradually increase the
number beyond 10 (while the server is under load) while monitoring your
baseline memory.