Storage Cheat Sheet

BioHPC – Storage Cheat Sheet

Filesystems Overview

/home2 - Home directories only. Configuration, code etc. Not for active data analysis. Mirror backup twice-weekly on Mon and Wed.

/project - Large space, high-performance**** for large files. Not for working with large numbers of small files. Archive large collections of small files (<1MB files) and avoid working on very small files (<100KB). No backup by defaullt. Incremental backup is available - PIs should email biohpc-help@utsouthwestern.edu if you would like any content on /project backed up

/work - This is also a high-performance filesystem for users to have LIVE HOT data since our recent upgrade. When using /work, you do not need to stripe your large single files for performance as in /project. Each user has 5 TB of space. /work is mirror backup'ed once per week (Friday/Saturday, no old versions).

/archive - This is a place for users to store COLD data. Each lab has 5TB of space by default. Quota can be increased upon approval. Accounting usage will be the exact of actual usage. /archive file system has similar directory tree setup as /project. Files that are unused for more than a year and also over 1 GB in size will be moved to Tape Storage and become 'free'.

Overall, Single thread writing to /work or /archive can be up to 2.3 GB/s, slightly faster than /project, metadata query is slightly slower. For most applications, you will not feel the performance difference among /work, /archive and /project.

** For applications which need to read large files from multiple threads concurrently (eg. sequencing applications reading large reference database), /work or /archive are optimal choices than /project since the IO throughput are passed from more arrays of disks.

****April 19, 2023 BioHPC has implemented a new two-tier file system for the /project directory upgrade to ensure that our users receive the best performance and storage capacity we can provide. Please see the user communication message for more information especially the quota policy change of the new /project file system:

Dear BioHPC Community,

We have implemented a new two-tier file system for the /project directory upgrade to ensure that our users receive the best performance and storage capacity we can provide. This new system has a high-performance tier (NVMe pool, composed of solid-state drives) and a long-term storage tier (HDD pool, composed of spinning-disk drives). Whenever a user saves a new file to /project, it is initially stored in the NVMe pool and counted towards their quota usage. After 10 minutes without the file being accessed, the file is mirrored to the HDD pool. Because there is a temporary period of time during which there are two copies, the file in question will count ‘double’ against your quota.

If the overall NVMe pool usage is high, the system will begin purge old files out of the NVMe pool and retain only a single copy in the HDD pool. Once this occurs, the file will once again only count towards the quota usage a single time.

If all of a user's files are in both the NVMe pool and HDD pool, meaning they are newer than most other users' files, their quota usage count will be around twice their actual total file sizes due to these ‘fresh’ files not being purged from the NVMe pool. If all of their files are only in the HDD pool, meaning they are older and untouched for many weeks compared to other users' files, their quota usage count will be very close to their actual total file sizes.

If a user has a mixture of old and new files, their quota usage count will be between 1-2 times their actual total file size.

To facilitate this change, the BioHPC steering committee approved an increase in each lab's /project quota limit by 1.2 times their original quota limit, based on our detailed research and evaluation. This increase is considered reasonable for most labs with a mix of old and new files.

However, if your lab continually generates new data and stores that data on the /project file system, it might be the case that this 1.2x adjustment is insufficient. If you find this is the case, please let us know, with your PI involved in the discussion, and we can work on a case-by-case basis to determine a new quota that accommodates your needs.

If you have any questions or concerns, please contact us at BioHPC-Help@UTSouthwestern.edu.

Thank you for your attention!

Sincerely,

The BioHPC Team

Quotas

/home2 - 50GB per user
/project - 6TB+ for the lab as agreed with PI/department chair (default 5TB for each new lab)
/work - 5TB per user (but not for long term storage)
/archive - 5TB+ for the lab as agreed with PI/department chair (default 5TB for each new lab)

Quota stats show soft and hard limits. You need to keep within the soft limit. The hard limit only exists to give a margin of safety so that jobs generating more data than expected do not fail.

Checking Quotas / Usage

The biohpc_quota command shows your quota status on each filesystem:

$ biohpc_quota Current BioHPC Storage Quotas for test (group: Test_lab): FILE | SPACE USAGE | NUMBER OF FILES SYSTEM | USED SOFT HARD | USED SOFT HARD ---------|----------------------------|--------------------------------- home2 | 334M 51200M 71680M | 12393 0 0 project | 35.65T 80T 90T | 28305097 0 0 work | 6.24T 5.00T 7.00T | 84370898 unlimited unlimited

To see individual usage for a user on project use the lfs quota command:

lfs quota –u <username> -h /project

$ lfs quota -u test -h /project Disk quotas for user test (uid 123456): Filesystem used quota limit grace files quota limit grace /project 9.3T 0k 0k - 37321317 0 0 -

Faster Performance for Very Large Files on /project

The /project filesystem is a parallel filesystem consisting of 40 storage targets. By default each file is stored on a single target. Each target can provide read speeds of up to 1GB/s depending on use.

Faster speeds can be achieved for very large files by striping the file across multiple targets. Most software can’t read files fast enough to benefit from striping – but some can. If you have many processes all reading from a single file then striping can also help improve the aggregate speed.

Some important rules:

            NEVER use a stripe count of more than 8 – usually no benefit, and it slows things down for others.
            ONLY stripe large files. Striping files <1GB will increase the load on the system with no real benefit.
            ONLY use the -c stripe count option for setstripe. Never change stripe index or stripe size!
            Try to set striping on directories – and keep large and small files separate so you can do this.

When you set striping on a directory it only applies to new files in that directory. To apply striping to old files you must copy (not move) them inside the directory that has striping set.

To set striping for a directory:

# -c option specified number of stripes, 4 in this case
lfs setstripe -c 4 /project/department/myuser/bigfiles

To see striping settings for a directory or file:

lfs getstripe /project/department/myuser/bigfiles

To apply striping to an existing file in a directory:

# Set striping on the directory lfs setstripe -c 4 /project/department/myuser/bigfiles cd /project/department/myuser/bigfiles # Copy the existing file to create a striped version cp myfile myfile.striped # Replace the original file with its striped copy mv myfile.striped myfile

BE CAREFUL – make sure you are certain you don’t overwrite the wrong thing. It can be safer to create a new directory and copy files into it.

How many Stripes?

The following general rules are appropriate for our storage system:

1	Default – Any file that doesn’t fit the criteria below.
2	Moderate size files 2-10GB that are read by 1-2 concurrent processes
4	Moderate size files 2-10GB that are read by 3+ concurrent processes regularly Large files 10GB+ that are read by 1-2 concurrent processes
8	Large files 10GB+ that are read by 3+ concurrent processes regularly Any very large files 200GB+ (to balance storage target usage)

Remember – performance is very good even without striping. You only have to worry about striping at all if you have a real need to increase performance, or are storing files that are 100s of GBs in size.