Data Management on UL HPC Facility
Copyright (c) 2020-2021 UL HPC Team <email@example.com>
Author: Sarah Peter
Note: To make it clear where you should execute a certain command, the prompt is prefixed with the location, i.e.
(access)$>for commands on the cluster access/login nodes
(node)$>for commands on a cluster node inside a job
(laptop)$>for commands locally on your machine
The actual command comes only after this prefix.
- Access to the UL HPC clusters.
- Basic knowledge of the linux command-line.
- How can I check the my quotas on file sizes and number of files?
- What do "soft quota", "hard quota" and "grace period" mean?
- How can I see how much space is used on a specific file system?
- How can I compute checksums?
- How can I verify checksums?
- How can I encrypt a file?
- How can I decrypt a file?
- How can I encrypt a directory of files?
- How can I read the files in an encrypted directory?
- Explain the
df-ulhpccommands to check disk usage and quota status.
- Compute MD5 and SHA-256 checksums and understand the difference between them.
- Verify MD5 and SHA-256 checksums.
- Encrypt a single file with GPG.
- Decrypt a GPG-encrypted file.
- Encrypt a whole folder with gocryptfs.
- Mount a gocryptfs-encrypted folder to read and write files in it.
- Unmount a gocryptfs-encrypted folder to secure it from unauthorized access.
Connect to the cluster.
We provide the
df-ulhpc command on the cluster login nodes, which displays current usage, soft quota, hard quota and grace period. Any directories that have exceeded the quota will be highlighted in red.
Check your file size quota with:
You will see a list of directories on which quotas are applied, how much space you are currently using, your soft quota, hard quota and the grace period.
Once your usage reaches the soft quota you can still write data until the grace period expires (7 days) or you reach the hard quota. After you reach the end of the grace period or the hard quota, you have to reduce your usage to below the soft quota to be able to write data again.
Check your inode quota with:
(access)$> df-ulhpc -i
Check the free space on all file systems with:
(access)$> df -h
Check the free space on the current file system with:
(access)$> cd (access)$> df -h . (access)$> cd /mnt/isilon/projects (access)$> df -h .
To see what directories are using your disk space and quota:
(access)$> cd (access)$> ncdu
Integrity of data files is critical for the verifiability of computational and lab-based analyses. The way to seal a data file's content at a point in time is to generate a checksum. Checksum is a small sized datum generated by running an algorithm, called a cryptographic hash function, on a file. As long as a data file does not change, the calculation of the checksum will always result in the same datum. If you recalculate the checksum and it is different from a past calculation, then you know the file has been altered or corrupted in some way.
Below are typical situations that call for checksum generation:
- A data file has been newly downloaded or received from a collaborator.
- You have copied data files to a new storage location, for instance you moved data from local computer to HPC to start an analysis. You want to create a snapshot of your data, for instance when you’re creating a supplementary material folder for a paper/report.
SHA is short for Secure Hash Algorithm. There are several versions of SHA that have been developed over time: SHA-1, SHA-2 and SHA-3. SHA-2 is a whole family of algorithms that create hash values of different lengths.
The most commonly used version and the one recommended by the National Institute of Standards and Technology (NIST) is SHA-256, which creates a hash value of 256 bits.
First, we need to start an interactive job and prepare some test data:
(access)$> si -t 01:00:00 (node)$> mkdir -p $SCRATCH/data_management (node)$> cd $SCRATCH/data_management (node)$> echo 'Happy secure computing!' > message.txt
We can create the SHA-256 checksum with the following command:
(node)$> sha256sum message.txt 40d61ef3ba32cc17f2c90db65e6c4d884d220b1999cbded7e80988541c9db11b message.txt
Since we want to store the checksum, we should save it to a file:
(node)$> sha256sum message.txt > message.sha256
Given a data file and its checksum, you can verify the file against the checksum with the following command:
(node)$> sha256sum -c message.sha256 message.txt: OK
Let us change the file and see what happens to the checksum:
(node)$> echo "This file has been changed." >> message.txt (node)$> sha256sum message.txt 01eed8eccae1728091ddc6b65fba5016a93757cab11f0c2d5c26b8e4d9321d11 message.txt (node)$> sha256sum -c message.sha256 message.txt: FAILED sha256sum: WARNING: 1 computed checksum did NOT match
The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. Although MD5 was initially designed to be used as a cryptographic hash function, it has been found to suffer from extensive vulnerabilities. It can still be used as a checksum to verify data integrity, but only against unintentional corruption.
(...) it should not be relied on if there is a chance that files have been purposefully and maliciously tampered. In the latter case, the use of a newer hashing tool such as sha256sum is recommended.
We can create the MD5 checksum with the following command:
(node)$> md5sum message.txt > message.md5
We can verify the file against the checksum similar to above:
(node)$> md5sum -c message.md5 message.txt: OK
Encryption is an effective measure to protect sensitive data.
- Encryption keys and passphrases need to be kept safe and protected from unauthorised access.
- Loosing your encryption key means loosing your data.
- Ensure you have an off-site backup of critical data stored on the platform under encryption.
- (Disaster) recovery of encrypted data is not guaranteed to be viable, depending on internal consistency when the recovery snapshot is taken.
Note that any use of user-level encryption remains under the responsability of the user, with her/him accepting any inherent risks, such as:
- loss of access to data due to loss of decryption password/keys
- data corruption due to encryption store corruption, or improper use of the encryption tools
We can encrypt our test file using GPG with the following command:
(node)$> gpg -c message.txt
Since you did not specify with what to encrypt your file, GPG will ask for a passphrase. Enter twice the same passphrase and make sure you remember it (in this case at least until the next step).
This command will also prompt GPG to generate a keyring, if you do not have one yet. The passphrase will be cached for the current SSH session.
This will create the encrypted file
message.txt.gpg next to the unencrypted file, so let us delete the unencrypted file:
(node)$> rm message.txt
You can decrypt the file with the following command:
(node)$> gpg message.txt.gpg gpg: CAST5 encrypted data gpg: encrypted with 1 passphrase gpg: WARNING: message was not integrity protected
You will see some output telling you that the file was encrypted with CAST5 and a warning that it was not integrity protected.
To avoid that warning and since CAST5 is an older algorithm, let us encrypt the file with a newer and stronger algorithm:
(node)$> rm message.txt.gpg (node)$> gpg --cipher-algo AES256 -c message.txt (node)$> rm message.txt (node)$> gpg message.txt.gpg gpg: AES256 encrypted data gpg: encrypted with 1 passphrase
Instead of using a passphrase, you can also encrypt files using an encryption key. You create an encryption key with GPG using the following command:
(node)$> gpg --gen-key
Keep in mind that DSA keys are deprecated, so you should generate a RSA key, ideally with 4096 bits. You will have to enter a couple of other details and then the key will be stored in the keyring.
You can list the contents of your keyring with
(node)$> gpg --list-keys
If you are going to transfer the data to someone else, you want to encrypt the data with the public key of the recipient, though, and not your own key. You can upload the public keys to public repositories for easier sharing. Never share, upload or transfer private keys!
You can export (the public part of) your key to a file with:
(node)$> gpg --output jane-doe.key --armor --export firstname.lastname@example.org
Make sure to replace
jane-doe with your name and use the email address that you specified when generating the key.
You can import other keys to your keyring using the following command:
(node)$> gpg --import jane-doe.key
To encrypt using a key, you must specify the email address associated to the key you want to use as recipient:
(node)$> gpg --encrypt --sign --recipient email@example.com message.txt
Gocryptfs is a modern implementation of an encryption overlay filesystem.
For more details on its inner workings, see the following:
- Security design documentation
- Threat model
- 2017 security audit, Audit report as PDF
- Gocryptfs source code
To use gocryptfs on the HPC platform you need to do the following steps:
1. Load gocryptfs profile from the modules system
(node)$> module load tools/gocryptfs
2. Create two folders
dir.crypt, which will act as the storage for the encrypted files (let’s call it crypt)
dir, which will present (on demand) the unencrypted view (let’s call it view)
(node)$> cd (node)$> mkdir data_management (node)$> cd data_management (node)$> mkdir dir.crypt dir
Currently gocryptfs does not work well on the Lustre filesystem (
$SCRATCH). You need to specify the additional option
-noprealloc to use it on Lustre. It works well on SpectrumScale/GPFS (
$HOME and project directories), though.
3. Initialize the crypt folder with a password
(node)$> gocryptfs -init dir.crypt Choose a password for protecting your files. Password: Repeat: Your master key is: e1ecdcf1-6bcebaa0-cbe6cfb8-8e27d4ad- acefb9d4-bd98de59-311d1898-31d7e4e4 If the gocryptfs.conf file becomes corrupted or you ever forget your password, there is only one hope for recovery: The master key. Print it to a piece of paper and store it in a drawer. This message is only printed once. The gocryptfs filesystem has been created successfully. You can now mount it using: gocryptfs dir.crypt MOUNTPOINT (node)$> ls dir.crypt/ gocryptfs.conf gocryptfs.diriv (node)$> ls dir
On crypt store initialization, gocryptfs provides us with the master key that can be used to restore access to the data files, especially useful in case the password is lost.
You should keep the master key safe, never store it unencrypted on the platform itself!
After initialization, the crypt store contains two internal configuration files:
gocryptfs.confis the global configuration for the crypt store, while
gocryptfs.dirivis created per-directory for encryption of file names
Note that you should never modify (any) files within the crypt store.
4. Mount the crypt folder into the view folder
To be able to access/store data, the crypt store needs to be mounted in the view folder
- this can be done by supplying the initially set password, either on the command line or from a file with
- … or with the generated master key, with the
- with the passfile option, it means that you have stored your password unencrypted on the filesystem - this is then a security risk!
- when using the master key mode, you should be in a full-node or exclusive job reservation such that there are no other users able to see the master key in the system
(node)$> gocryptfs dir.crypt dir Password: Decrypting master key Filesystem mounted and ready. (node)$> ls dir
5. Add files to the view folder
All your processing (new file/folder creation, modification and transfers) will happen in the view folder.
Once the crypt store is mounted in the view directory we can create files in the latter:
- any folder/file created in the unencrypted view will have a 1:1 correspondent in the crypt store
- the plain text
message.txtfile is stored in encrypted format as
5o_WSYN-Tn59W3vrPiHXEAin the underlying crypt store (file name metadata is encrypted as well)
- the same permissions applied on
message.txtare also set for its encrypted correspondent file
(node)$> echo "Happy secure computing" > dir/message.txt (node)$> ls dir message.txt (node)$> ls dir.crypt/ 5o_WSYN-Tn59W3vrPiHXEA gocryptfs.conf gocryptfs.diriv
6. Unmount the view folder
At the end of our processing, we are using
fusermount explicitly to unmount the encrypted overlay, such that the unencrypted view of your data is closed and data is flushed to the regular filesystem.
Note that you should always ensure that this happens before your job reservation expires.
(node)$> fusermount -u dir (node)$> ls dir (node)$> ls dir.crypt/ 5o_WSYN-Tn59W3vrPiHXEA gocryptfs.conf gocryptfs.diriv
Other important details
Data stored in a crypt store should not be used concurrently (e.g. by multiple users at the same time). The special option
-sharedstorageexists for this use-case, but is not guaranteed to work for all applications.
(Parallel) Applications ran through
srunon the Iris cluster cannot "see" the unencrypted view folder as they are run in a different context.
This is also the case if you use
srun --jobid to attach your terminal to a running job.
-initcommand has an option
-plaintextnamesto preserve file names.
Gocryptfs store password management
You can change the password of an existing crypt store with the
(node)$> gocryptfs -passwd dir.crypt/ Password: [your current password here] Decrypting master key Please enter your new password. Password: [your new password here] Repeat: [your new password here] Password changed.
Note that the master key does not change.
For running batch processing on a gocryptfs-based folder, you can provide the decryption password through an external application with the
(node)$> gocryptfs -extpass "echo foobar" dir.crypt dir Reading password from extpass program Decrypting master key Filesystem mounted and ready.
Note that this means that another application stores/has access to the password - this is then a security risk!
- Wikipedia - MD5
- Wikipedia - Md5sum
- Wikipedia - SHA-2
- LCSB How-to Card on checksums
- LCSB How-to Card on encryption
- UL HPC blog post on sensitive data encryption
Many thanks to Valentin Plugaru for the initial version of the gocryptfs part and the LCSB data stewards and R3 teams.