By ULHPC Licence GitHub issues Github Documentation Status GitHub forks

Data Management on UL HPC Facility

 Copyright (c) 2020 UL HPC Team <hpc-sysadmins@uni.lu>

Author: Sarah Peter

Note: To make it clear where you should execute a certain command, the prompt is prefixed with the location, i.e.

  • (access)$> for commands on the cluster access/login nodes
  • (node)$> for commands on a cluster node inside a job
  • (laptop)$> for commands locally on your machine

The actual command comes only after this prefix.

Overview

Requirements

  • Access to the UL HPC clusters.
  • Basic knowledge of the linux command-line.

Questions

  • How can I check the my quotas on file sizes and number of files?
  • What do "soft quota", "hard quota" and "grace period" mean?
  • How can I see how much space is used on a specific file system?
  • How can I compute checksums?
  • How can I verify checksums?
  • How can I encrypt a file?
  • How can I decrypt a file?
  • How can I encrypt a directory of files?
  • How can I read the files in an encrypted directory?

Objectives

  • Explain the df and df-ulhpc commands to check disk usage and quota status.
  • Compute MD5 and SHA-256 checksums and understand the difference between them.
  • Verify MD5 and SHA-256 checksums.
  • Encrypt a single file with GPG.
  • Decrypt a GPG-encrypted file.
  • Encrypt a whole folder with gocryptfs.
  • Mount a gocryptfs-encrypted folder to read and write files in it.
  • Unmount a gocryptfs-encrypted folder to secure it from unauthorized access.

Preliminaries

Connect to the cluster.

Quotas

We provide the df-ulhpc command on the cluster login nodes, which displays current usage, soft quota, hard quota and grace period. Any directories that have exceeded the quota will be highlighted in red.

Check your file size quota with:

(access)$> df-ulhpc

You will see a list of directories on which quotas are applied, how much space you are currently using, your soft quota, hard quota and the grace period.

Once your usage reaches the soft quota you can still write data until the grace period expires (7 days) or you reach the hard quota. After you reach the end of the grace period or the hard quota, you have to reduce your usage to below the soft quota to be able to write data again.

Check your inode quota with:

(access)$> df-ulhpc -i

Check the free space on all file systems with:

(access)$> df -h

Check the free space on the current file system with:

(access)$> cd
(access)$> df -h .
(access)$> cd /mnt/isilon/projects
(access)$> df -h .

To see what directories are using your disk space and quota:

(access)$> cd
(access)$> ncdu

Checksums

Integrity of data files is critical for the verifiability of computational and lab-based analyses. The way to seal a data file's content at a point in time is to generate a checksum. Checksum is a small sized datum generated by running an algorithm, called a cryptographic hash function, on a file. As long as a data file does not change, the calculation of the checksum will always result in the same datum. If you recalculate the checksum and it is different from a past calculation, then you know the file has been altered or corrupted in some way.

Below are typical situations that call for checksum generation:

  • A data file has been newly downloaded or received from a collaborator.
  • You have copied data files to a new storage location, for instance you moved data from local computer to HPC to start an analysis. You want to create a snapshot of your data, for instance when you’re creating a supplementary material folder for a paper/report.

LCSB How-to Cards

MD5

The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. Although MD5 was initially designed to be used as a cryptographic hash function, it has been found to suffer from extensive vulnerabilities. It can still be used as a checksum to verify data integrity, but only against unintentional corruption.

Wikipedia - MD5

(...) it should not be relied on if there is a chance that files have been purposefully and maliciously tampered. In the latter case, the use of a newer hashing tool such as sha256sum is recommended.

Wikipedia - Md5sum

First, we need to start an interactive job and prepare some test data:

(access)$> si
(node)$> mkdir -p $SCRATCH/data_management
(node)$> cd $SCRATCH/data_management
(node)$> echo 'Happy secure computing!' > message.txt

We can create the MD5 checksum with the following command:

(node)$> md5sum message.txt
05babfa77d503a60561ceab31d767b34  message.txt

Since we want to store the checksum, we should save to a file:

(node)$> md5sum message.txt > message.md5

Given a data file and its checksum, you can verify the file against the checksum with the following command:

(node)$> md5sum -c message.md5
message.txt: OK

Let us change the file and see what happens to the checksum:

(node)$> echo "This file has been changed." >> message.txt
(node)$> md5sum message.txt
ebaf49a0b2482da084bca535db0423ba  message.txt
(node)$> md5sum -c message.md5
message.txt: FAILED
md5sum: WARNING: 1 computed checksum did NOT match

SHA

SHA is short for Secure Hash Algorithm. There are several versions of SHA that have been developed over time: SHA-1, SHA-2 and SHA-3. SHA-2 is a whole family of algorithms that create hash values of different lengths.

The most commonly used version and the one recommended by the National Institute of Standards and Technology (NIST) is SHA-256, which creates a hash value of 256 bits.

We can create the SHA-256 checksum with the following command:

(node)$> sha256sum message.txt > message.sha256

We can verify the file against the checksum similar to above:

(node)$> sha256sum -c message.sha256
message.txt: OK

Encryption

Encryption is an effective measure to protect sensitive data.

IMPORTANT NOTICE:

  • Encryption keys and passphrases need to be kept safe and protected from unauthorised access.
  • Loosing your encryption key means loosing your data.
  • Ensure you have an off-site backup of critical data stored on the platform under encryption.
  • (Disaster) recovery of encrypted data is not guaranteed to be viable, depending on internal consistency when the recovery snapshot is taken.

Note that any use of user-level encryption remains under the responsability of the user, with her/him accepting any inherent risks, such as:

  • loss of access to data due to loss of decryption password/keys
  • data corruption due to encryption store corruption, or improper use of the encryption tools

GPG

We can encrypt our test file using GPG with the following command:

(node)$> gpg -c message.txt

Since you did not specify with what to encrypt your file, GPG will ask for a passphrase. Enter twice the same passphrase and make sure you remember it (in this case at least until the next step).

This command will also prompt GPG to generate a keyring, if you do not have one yet. The passphrase will be cached for the current SSH session.

This will create the encrypted file message.txt.gpg next to the unencrypted file, so let us delete the unencrypted file:

(node)$> rm message.txt

You can decrypt the file with the following command:

(node)$> gpg message.txt.gpg
gpg: CAST5 encrypted data
gpg: encrypted with 1 passphrase
gpg: WARNING: message was not integrity protected

You will see some output telling you that the file was encrypted with CAST5 and a warning that it was not integrity protected.

To avoid that warning and since CAST5 is an older algorithm, let us encrypt the file with a newer and stronger algorithm:

(node)$> rm message.txt.gpg
(node)$> gpg --cipher-algo AES256 -c message.txt
(node)$> rm message.txt
(node)$> gpg message.txt.gpg
gpg: AES256 encrypted data
gpg: encrypted with 1 passphrase

Optional

Instead of using a passphrase, you can also encrypt files using an encryption key. You create an encryption key with GPG using the following command:

(node)$> gpg --gen-key

Keep in mind that DSA keys are deprecated, so you should generate a RSA key, ideally with 4096 bits. You will have to enter a couple of other details and then the key will be stored in the keyring.

You can list the contents of your keyring with

(node)$> gpg --list-keys

If you are going to transfer the data to someone else, you want to encrypt the data with the public key of the recipient, though, and not your own key. You can upload (the public) keys to public repositories for easier sharing.

You can export your key to a file with:

(node)$> gpg --output jane-doe.key --armor --export jane.doe@uni.lu

Make sure to replace jane-doe with your name and use the email address that you specified when generating the key.

You can import other keys to your keyring using the following command:

(node)$> gpg --import jane-doe.key

To encrypt using a key, you must specify the email address associated to the key you want to use as recipient:

(node)$> gpg --encrypt --sign --recipient jane.doe@uni.lu message.txt

Gocryptfs

Gocryptfs is a modern implementation of an encryption overlay filesystem.

For more details on its inner workings, see the following:

To use gocryptfs on the HPC platform you need to do the following steps:

1. Load gocryptfs profile from the modules system

(node)$> module load tools/gocryptfs

2. Create two folders

  • dir.crypt, which will act as the storage for the encrypted files (let’s call it crypt)
  • dir, which will present (on demand) the unencrypted view (let’s call it view)
(node)$> cd $SCRATCH/data_management
(node)$> mkdir dir.crypt dir

3. Initialize the crypt folder with a password

(node)$> gocryptfs -init dir.crypt
Choose a password for protecting your files.
Password: 
Repeat: 

Your master key is:

e1ecdcf1-6bcebaa0-cbe6cfb8-8e27d4ad-
acefb9d4-bd98de59-311d1898-31d7e4e4

If the gocryptfs.conf file becomes corrupted or you ever forget your password, there is only one hope for recovery: The master key. Print it to a piece of paper and store it in a drawer. This message is only printed once.

The gocryptfs filesystem has been created successfully.
You can now mount it using: gocryptfs dir.crypt MOUNTPOINT

(node)$> ls dir.crypt/
gocryptfs.conf  gocryptfs.diriv
(node)$> ls dir

On crypt store initialization, gocryptfs provides us with the master key that can be used to restore access to the data files, especially useful in case the password is lost.

You should keep the master key safe, never store it unencrypted on the platform itself!

After initialization, the crypt store contains two internal configuration files:

  • gocryptfs.conf is the global configuration for the crypt store, while

  • gocryptfs.diriv is created per-directory for encryption of file names

Note that you should never modify (any) files within the crypt store.

4. Mount the crypt folder into the view folder

To be able to access/store data, the crypt store needs to be mounted in the view folder

  • this can be done by supplying the initially set password, either on the command line or from a file with -passfile option
  • … or with the generated master key, with the -masterkey option
  • with the passfile option, it means that you have stored your password unencrypted on the filesystem - this is then a security risk!
  • when using the master key mode, you should be in a full-node or exclusive job reservation such that there are no other users able to see the master key in the system
(node)$> gocryptfs dir.crypt dir
Password: 
Decrypting master key
Filesystem mounted and ready.
(node)$> ls dir

5. Add files to the view folder

All your processing (new file/folder creation, modification and transfers) will happen in the view folder.

Once the crypt store is mounted in the view directory we can create files in the latter:

  • any folder/file created in the unencrypted view will have a 1:1 correspondent in the crypt store
  • the plain text message.txt file is stored in encrypted format as 5o_WSYN-Tn59W3vrPiHXEA in the underlying crypt store (file name metadata is encrypted as well)
  • the same permissions applied on message.txt are also set for its encrypted correspondent file
(node)$> echo "Happy secure computing" > dir/message.txt
(node)$> ls dir
message.txt
(node)$> ls dir.crypt/
5o_WSYN-Tn59W3vrPiHXEA  gocryptfs.conf  gocryptfs.diriv

6. Unmount the view folder

At the end of our processing, we are using fusermount explicitly to unmount the encrypted overlay, such that the unencrypted view of your data is closed and data is flushed to the regular filesystem.

Note that you should always ensure that this happens before your job reservation expires.

(node)$> fusermount -u dir
(node)$> ls dir
(node)$> ls dir.crypt/
5o_WSYN-Tn59W3vrPiHXEA  gocryptfs.conf  gocryptfs.diriv

Other important details

  • Data stored in a crypt store should not be used concurrently (e.g. by multiple users at the same time). The special option -sharedstorage exists for this use-case, but is not guaranteed to work for all applications.

  • (Parallel) Applications ran through srun on the Iris cluster cannot "see" the unencrypted view folder as they are run in a different context.

This is also the case if you use sjoin or srun --jobid to attach your terminal to a running job.

Gocryptfs store password management

You can change the password of an existing crypt store with the -passwd option:

(node)$> gocryptfs -passwd dir.crypt/
Password: [your current password here]
Decrypting master key
Please enter your new password.
Password: [your new password here]
Repeat: [your new password here]
Password changed.

Note that the master key does not change.

For running batch processing on a gocryptfs-based folder, you can provide the decryption password through an external application with the -extpass option:

(node)$> gocryptfs -extpass "echo foobar" dir.crypt dir
Reading password from extpass program
Decrypting master key
Filesystem mounted and ready.

Note that this means that another application stores/has access to the password - this is then a security risk!

References

Acknowledgements

Many thanks to Valentin Plugaru for the initial version of the gocryptfs part and the LCSB data stewards and R3 teams.