Working with data
Metacentrum wiki is deprecated after March 2023
Dear users, due to integration of Metacentrum into https://docs.e-infra.cz e-INFRA CZ service, the documentation for users will change format and site.
The current wiki pages won't be updated after end of March 2023. They will, however, be kept for a few months for backwards reference.
The new documentation resides at https://docs.metacentrum.cz.
This topical guide provides the most important information about data manipulation, storage and archiving in MetaCentrum.
Data storage: managing large numbers of files
Keeping large number of files (>=million of files) in user's homes is problematic, since it significantly increases the time needed to backup the home directories as well as to manipulate them for any other purpose. At such quantities, the number of the files is the limiting factor, not their size. To keep the service operations sustainable, there is a quota on number of files. We encourage the users who exceed the quota either to remove the data, or to squash them into suitably sized chunks. It is our experience that often the directories with millions of files result from a job gone amok and as such present a "dead weight". Users who really need to store large numbers of files can contact user support email@example.com and ask for an exception.
Check your quota of number of files
You can see the state of your quotas at
- the table that appears after you login on a frontend
- your quota overview at MetaVO web
Remove, tar, or tar and archive the excess files
If the data is junk, remove it either within a single command, e.g.
(BUSTER)melounova@tarkil:~$ ssh firstname.lastname@example.org rm -rf ~/junk_dir
or wrap the command into a batch job to avoid waiting for the command to end:
(BUSTER)melounova@tarkil:~$ qsub -l walltime=24:00:00 remove_junk_dir.sh
If the data is not junk, pack them them into larger chunks using the tar command either from a command line or from within a batch job:
(BUSTER)melounova@tarkil:~$ ssh email@example.com tar -cf not_junk_dir.tar ~/not_junk_dir (BUSTER)melounova@tarkil:~$ qsub -l walltime=24:00:00 tar_my_files.sh
If you have enough space on your storage directories, you can keep the packed data there. However we encourage users to archive the data that are of permanent value, large and not accessed frequently.
If you really need to keep large numbers of files in your home directory, contact us at user support mail firstname.lastname@example.org.
Basic commands for data transfer
Basic commands for data transfer are scp, wget and sftp.
Windows alert: Windows users will need an application that emulates the Linux commands for data transfer. See How to ssh from Windows.
scp works in pretty much the same way as normal cp command, only it allows you to copy files between different machines.
marenka@home_PC:~$ scp my_file.txt email@example.com: # copy file "my_file.txt" to a home folder of user "jenicek" on a frontend "skirit.metacentrum.cz" marenka@home_PC:~$ scp -r my_dir firstname.lastname@example.org: # as above; copy directory "my_dir" together with all subdirectories marenka@home_PC:~$ scp -r email@example.com:~/results . # from jenicek's home on skirit, copy to marenka's local PC folder "results" marenka@home_PC:~$ scp -r firstname.lastname@example.org:~/../fsbrno2/home/jenicek/results . # copy jenicek's folder "results" directly from /storage/brno2 (see section below for explanation of the path and server address)
Alternative way to download data is a wget command. wget will work only if the file is available via ftp or http(s) network protocols, typically when it is a downloadable file on some server. wget is faster and less safe than scp, so it may be a method of choice if you need to download larger amount of data from Internet where privacy is not an issue.
ssh email@example.com # login to a frontend; replace "jenicek" by your real username firstname.lastname@example.org:~$ mkdir data; cd data # create and enter directory "data" where the data will be downloaded to email@example.com:~/data$ wget https://www.someServer.org/someData.zip # download file "someData.zip" from a server (= webpage) "https://www.someServer.org"
By wget you can only transfer data to Metacentrum machines. wget is of no use if you want to transfer from Metacentrum.
sftp is just another protocol for transferring data. Contrary to scp it is interactive, slower and apart from copying it also enables the user to manipulate files and directories on the remote side. We recommend to use scp if you need only to copy the data.
|Usage of WinSCP|
|About SFTP protocol||
Windows users need an SFTP client, we recommend the WinSCP application. Keep in mind you have to fill in as target chosen NFS4 server instead of frontend in Step 1. Make sure you have selected SFTP file protocol, too.
Linux users just open a terminal and use sftp command as shown below. More about sftp command can be found in this external link.
sftp 'META username'@target_NFS4_server # Login help # Shows available commands get target_file # Downloads target file to your local system get -r target_directory # Downloads target directory to your local system put target_file # Uploads target file to server put -r target_directory # Uploads target directory to server
There is a bug affecting Ubuntu 14.04+ concerning the recursive copy command: put -r . If put -r fails, create the target directory on the server first to work around this issue.
The rsync command is a more advanced and versatile copying tool. It enables the user to synchronize the content of two directories in a more efficient way than scp, because rsync copies only the differences between the directories. Therefore it is often used as a tool for regular backups.
Copy directory data to archive:
$ rsync -zvh /storage/brno2/home/melounova/data /storage/du-cesnet/home/melounova/VO_metacentrum-tape_tape
Copy only the content of directory data to archive:
$ rsync -zvh /storage/brno2/home/melounova/data/ /storage/du-cesnet/home/melounova/VO_metacentrum-tape_tape
Data transfer: moderate amount of data (<=1000 files, <= 100 GB)
Moderate amount of data (hundreds of individual files and/or less that 100 GB) can be transferred to/from MetaCentrum machines in a straightforward way.
The server you login to is one of the frontends.
The path to locate your destination/source data is the same path you see when you ale logged in on a frontend.
melounova@home_PC:~$ scp firstname.lastname@example.org:/storage/brno2/home/melounova/foo . # copy file "foo" from "brno2" storage through frontend skirit to my local PC
Data transfer: large amount of data(>1000 files, > 100 GB)
Data transfer between storages and PC, principles
When transferring large amount of data we ask users to avoid frontends. This is because transfer of large data can overload the frontend and cause slowdown, which is inconvenient to other users.
For example, the command
melounova@home_PC:~$ scp -r email@example.com:/storage/brno2/home/melounova/dir-with-thousands-files .
The data are not stored on frontend, but they load its CPUs and RAM. Therefore for large data it is better to access the data storages (NFS4 servers) directly.
The direct-access-equivalent to the command above is
melounova@home_PC:~$ scp -r firstname.lastname@example.org:~/../fsbrno2/home/melounova/dir-with-thousands-files .
and it can be visualised as:
Why do I log in to brno6, if I want to access brno2?
As hardware is changing, the user data are moved to new disk fields with new names. For convenience the old names, such as "brno2" are kept as symlinks, so that users don't need to revise all their scripts and aliases every time there is a change. When working from a frontend, everything remains the same no matter the changes in the background. For example, the brno2 still exists as a symlink, although the original brno2-hardware was replaced and the data now reside physically on brno6.
Since the direct access avoids the frontend, you cannot use the symbolic links, but you need to use real server names and correct paths. Although they can be figured out from the directory tree, for convenience we collect in the following table a list of storages, list of server names and corresponding paths.
|Which /storage||server name||path to user homes||example: how to copy from this storage to local PC|
In case something does not work as expected or some storage is missing from this table, write us at email@example.com.
Data transfer between storages using scp
If you want to move large amount of data between storages, the setup is similar as in the case when you copy data between your PC and a storage. The only difference is the you cannot access storages interactively (see Working with data) and therefore the scp command has to be passed as an argument to ssh command.
For example, copy file foo from plzen1 to your home at brno2:
ssh USERNAME@storage-plzen1.metacentrum.cz "scp foo storage-brno6.metacentrum.cz:~/../fsbrno2/home/USERNAME/"
If you are already logged on a frontend, you can simplify the command to:
ssh storage-plzen1 "scp foo storage-brno6:~/../fsbrno2/home/USERNAME/"
The scp-command examples shown above will run only until you either disconnect, or the validity of Kerberos ticket expires. For longer-lasting copy operations, it is a good idea to submit the scp command within a job. Prepare a trivial batch script called e.g. copy_files.sh
#!/bin/sh #PBS -N copy_files #PBS -l select=1:ncpus=1:scratch_local=1gb #PBS -l walltime=15:00:00 ssh storage-plzen1 "scp foo storage-brno6:~/../fsbrno2/home/USERNAME/"
and submit it as
Data transfer between storages using rsync
Another option how to pass data between storages is to use rsync command.
For example, to move all your data from plzen1 to brno12-cerit:
(BUSTER)USERNAME@skirit:~$ ssh storage-plzen1 "rsync -avh ~ storage-brno12-cerit:~/home_from_plzen1/"
To move only a selected directory:
(BUSTER)USERNAME@skirit:~$ ssh storage-plzen1 "rsync -avh ~/my_dir storage-brno12-cerit:~/my_dir_from_plzen1/"
You can wrap the rsync command into a job, too.
#!/bin/sh #PBS -N copy_files #PBS -l select=1:ncpus=1:scratch_local=1gb #PBS -l walltime=15:00:00 ssh storage-plzen1 "rsync -avh ~ storage-brno12-cerit:~/home_from_plzen1/"
If you then look at the output of running job you can check how the data transfer proceeds.
USERNAME@NODE:~$ tail -f /var/spool/pbs/spool/JOB_ID.meta-pbs.metacentrum.cz.OU
Other ways to access /storage directly
Selected programs serving for data manipulation directly at the NFSv4 storage server can be run through SSH. On the other hand these operations can easily overload NFSv4 server. If you plan massive file moves, contact us in advance, please.
- Apart from the cerit NFS4 servers, there is no shell available on storage servers, so typing simply
ssh user@NFS4.storage.czwill not work (you can log in, but you will be immediately logged out). Instead, use the construction
ssh user@NFS4.storage.cz command.
- It is not possible to run programs at storage volume. No computation should be run at the NFSv4 server.
- When copying files with
ddset block size (bs parameter) to at least 1 M (comparing with default 512 byte). Operations will be faster.
On storage servers, only the following commands are available:
List the content of home directory on remote machine by the following command:
ssh USERNAME@storage-brno6.metacentrum.cz ls -l
Full path can be used as well:
ssh USERNAME@storage-brno6.metacentrum.cz ls -l /home/USERNAME
Mount storage on local station
For more advanced users, there is also the possibility to mount the data storages locally. The NFS4 servers can then be accessed in the same way as local disk. Follow the tutorial in Mounting_data_storages_on_local_station to learn how to mount the storages locally.
|Types of scratch storage|
|CESNET data care||
There are three data storages types offered by MetaCentrum:
|Storage type||Basic description||Typical usage|
|Scratch storages||Fast storages with minimum data capacity||Working with data during computations|
|Disk arrays||/storage volumes in MetaCentrum||Data storing between computations|
|Hierarchical storages||Storages with massive data capacity||Data archiving|
Scratch storages are accessible via scratch directory on computational nodes. Use this storages during computations only. This means the batch script should clean up the scratch after the job is done or, if the data are left in scratch, it should be done manually (as is the case when the job fails or is killed) – see Beginners guide. Data on scratch storages are automatically deleted after 14 days.
Disk arrays are several connected hard drives and are accessible via /storage directories. Files are stored on multiple drives, which guarantees higher I/O data speed and reliability. Use disk arrays for preparing data and storing data between jobs.
|NFS4 server||adresář - directory||velikost - capacity||zálohovací třída - back-up policy||alternativní jména serverů v Perunovi - alternative name / poznámka - note|
|storage-brno3-cerit.metacentrum.cz||/storage/brno3-cerit/||WILL BE decomissioned||2||data moved to /storage/brno12-cerit/|
|storage-brno4-cerit-hsm.metacentrum.cz||/storage/brno4-cerit-hsm/||zrušeno - decommissioned||data archived in /storage/brno1-cerit/|
|storage-brno5-archive.metacentrum.cz||/storage/brno5-archive/||zrušeno - decommissioned||3||nfs.du3.cesnet.cz|
|storage-brno6.metacentrum.cz||/storage/brno6/||zrušeno - decommissioned||2|
|storage-brno7-cerit.metacentrum.cz||/storage/brno7-cerit/||zrušeno - decommissioned||2||data archived in /storage/brno1-cerit/|
|storage-brno8.metacentrum.cz||/storage/brno8/||zrušeno - decommissioned||3||in past /storage/ostrava1/, data moved to /storage/brno2/home/USERNAME/brno8|
|storage-brno9-ceitec.metacentrum.cz||/storage/brno9-ceitec/||zrušeno - decommissioned||3||storage-ceitec1.ncbr.muni.cz - pro NCBR CEITEC|
|storage-brno10-ceitec-hsm.metacentrum.cz||/storage/brno10-ceitec-hsm/||zrušeno - decommissioned||3||dedicated to NCBR CEITEC|
|storage-brno11-elixir.metacentrum.cz||/storage/brno11-elixir/||313 TB||2||dedicated to ELIXIR-CZ, storage2.elixir-czech.cz|
|storage-brno12-cerit.metacentrum.cz||/storage/brno12-cerit/||3.4 PB||2||ces-hsm.cerit-sc.cz, domovský adresář v nfs4/home/$USER|
|storage-jihlava1-cerit.metacentrum.cz||/storage/jihlava1-cerit/||zrušeno - decommissioned||data archived to /storage/brno4-cerit-hsm/fineus, storage-brno4-cerit-hsm.metacentrum.cz, symlink /storage/jihlava1-cerit/|
|storage-jihlava2-archive.metacentrum.cz||/storage/jihlava2-archive/||zrušeno - decommissioned|
|storage-du-cesnet.metacentrum.cz||/storage/du-cesnet/||3||du4.cesnet.cz, optimal archive storage for all MetaCentrum users|
|storage-plzen2-archive.metacentrum.cz||/storage/plzen2-archive/||zrušeno - decommissioned||nfs.du1.cesnet.cz|
|storage-plzen3-kky.metacentrum.cz||/storage/plzen3-kky/||zrušeno - decommissioned||nahrazeno plzen4-ntis|
|storage-plzen4-ntis.metacentrum.cz||/storage/plzen4-ntis/||200 TiB||3||pro cleny skupiny iti/kky|
|storage-praha1.metacentrum.cz||/storage/praha1/||zrušeno - decommissioned||3||storage-praha1(a|b).metacentrum.cz|
|storage-praha4-fzu.metacentrum.cz||/storage/praha4-fzu/||zrušeno - decommissioned 15 TB|
|Zálohovací třídy jsou popsány v / Back-up policy is described at: Politika zálohování (Back-up policy). Výtah/summary:
There are several /storage volumes. Their names (e.g.
/storage/plzen1) reflect their physical placement in Czech republic (in the example cities of Brno, resp. Plzen). The home directory on frontend (
/home) is mapped on one of the storages (
/storage/CITY_XY/home) in the same location. E.g. the home of skirit frontend, which is physically located in Brno, is mapped to
/storage/brno2/home directory. Use ls -l command to find out:
(STRETCH)melounova@skirit:~$ ls -l /home lrwxrwxrwx 1 root root 19 zář 17 2018 /home -> /storage/brno2/home
It is reasonable to use one of the frontends/storages closest to you, as it is usually faster. However it does not mean you cannot use any other storage volume which you are allowed to access! The only thing to keep in mind it that in such case you have to use explicit path:
(STRETCH)melounova@skirit:~$ pwd # pwd - print working (=current) directory /storage/brno2/home/melounova # I am in Brno 2 storage now (STRETCH)melounova@skirit:~$ cd /storage/plzen1/home/melounova/ (STRETCH)melounova@skirit:/storage/plzen1/home/melounova$ pwd /storage/plzen1/home/melounova # I am in Plzen 1 storage now
Disk arrays have a backup policy of saving snapshots (once a day, the backup is done usually during the night) of user's data. The snapshots are kept at least 14 days backwards. This offers some protection in case user unintentionally deletes some of his/her files. Generally, data that existed the day before the accident can be recovered. The snapshots are stored, however, on the same disk arrays as the data, so in case of e.g. hardware failure of the disks these backups will be lost. Therefore we recommend strongly to backup any important data elsewhere. For archiving purposes MetaCentrum offers dedicated storage servers.
Disk arrays with hierarchical storage
Disk arrays with hierarchical storage have a more robust backup policy and should be used primarily for archiving purposes. To increase redundancy of data, they contain several layers of storage media. The first layer is a disk array, lower layers are made of MAIDs (Massive Array of Idle Drives) or magnetic tape libraries. Lower layers have bigger capacity but slower access times. Data are moved automatically among these layers based on their last usage. The most important consequence from the user's point of view is that the access to long-unused data may be slower than to the recently-used ones.
Use hierarchical storages for storing data which you do not currently use, but which are important and not easily reproducible in the future.
Data archiving and backup
Since the data in "normal" home directories are backed-up only in a form of snapshots, they are not protected against data loss due to hardware failure. The data of permanent value which would be hard to recreate should be backed-up on dedicated servers with hierarchical storage policy. Among other NFS4 servers they can be identified by their name, which contains hsm (hierarchical storage machine) or archive.
|NFS4 server||adresář - directory||status||alias v Perunovi / aliased in Perun as ; poznámka / note|
|storage-du-cesnet.metacentrum.cz||/storage/du-cesnet/||active||primary space for MetaCentrum users|
|storage-brno14-ceitec.metacentrum.cz||/storage/brno14-ceitec/||active||visible only to NCBR/CEITEC users|
|storage-ostrava2-archive.metacentrum.cz||/storage/ostrava2-archive/||zrušeno - decommissioned||exists formally as symlink to /storage/du-cesnet/|
|storage-brno5-archive.metacentrum.cz||/storage/brno5-archive/||zrušeno - decommissioned||data will be archived in ---> /storage/du-cesnet|
|storage-jihlava2-archive.metacentrum.cz||/storage/jihlava2-archive||zrušeno - decommissioned||data archived in /storage/du-cesnet|
|storage-plzen2-archive.metacentrum.cz||/storage/plzen2-archive/||zrušeno - decommissioned||data archived in /storage/du-cesnet|
|storage-brno4-cerit-hsm.metacentrum.cz||/storage/brno4-cerit-hsm/||zrušeno - decommissioned||data archived in /storage/brno1-cerit/|
The users are free to access any active server in the table above directly, however we recommend to use the directory
/storage/du-cesnet/home/META_username/VO_metacentrum-tape_tape-archive/ (for archiving (long term)) or
/storage/du-cesnet/home/META_username/VO_metacentrum-tape_tape/ (for backup service).
ssh firstname.lastname@example.org # log in to any frontend, replace "jenicek" by your real username cd /storage/du-cesnet/home/jenicek/VO_metacentrum-tape_tape-archive/ # enter the directory for archival data cd /storage/du-cesnet/home/jenicek/VO_metacentrum-tape_tape/ # enter the directory for backup data
Both of these directories are so-called symlinks (symbolic links), which only point to an actual HSM server. Contrary to "normal" path, the symlink is a more abstract construction and is not dependent on the actual HSM server currently in use. For example, if an old HSM server (in some location) is replaced by a new one (possibly in different location), the link will remain valid and the user does not need to rewrite the path in all his/her backup and archiving scripts.
Never leave data directly in the home, i.e. in
/storage/du-cesnet/home/META_username/. The home directory should serve only to keep SSH keys, making links to directories with the actual data and other configuration files. To enforce this, there is tiny quota set on home directory, see https://du.cesnet.cz/en/navody/home-migrace-plzen/start.
Backup or archive?
On technical level, there is no difference between the
VO_metacentrum-tape_tape directories. What differs is the policy applied to the data stored in either of them.
Permanent data archives are normally limited in size (typically results of some research, not raw data) and the user wants to keep then "forever". Therefore the
VO_metacentrum-tape_tape-archive has user quota set for volume of data and/or number of files. On the other hand the data are not removed after a time (they do not "expire"). Use this link if you want to stash away data of permanent value.
Backed-up data serve to protect from data loss in case the primary data are lost. Typically these data need not to be kept for a very long time. Therefore in
VO_metacentrum-tape_tape the files older than 12 months are automatically removed (they are considered as "expired"). Use this link if you want to protect your current data e.g. from HW failure of the server where the primary data are stored.
A few notes
- Actual usage of storages http://metavo.metacentrum.cz/pbsmon2/nodes/physical, search for "Hierarchical storages"
- The documentation of the directory structure in HSM servers can be found on https://du.cesnet.cz/wiki/doku.php/en/navody/home-migrace-plzen/start
- The complete storage facility documentation: https://du.cesnet.cz/wiki/doku.php/en/navody/start
- The user's quota on the particular storage can be found at MetaCentrum portal
- On the HSM storages the user quota is not applied. Only a technical limitation of 5TB, involving an overloading of the HSM with a one-time data copy, is applied.
Transfering the files to/from the archive
In general: the smaller number of files in the archive, the better (it speeds operations up and generates lower load on the storage subsystems; on the other hand, packing the files makes searching less comfortable). In case you need to archive a large number of small files, we recommend strongly to pack them before, as read/write operations are slower with many small files. Often there is quota set not only on the volume, but on the number of files, too; having hundrehts thousands of small files can hit this quota.
- if most of your files are large (hundreds of MBs, GBs, ...), don't bother with packing them and make a one-to-one copy to the archive,
- if your files are smaller and you don't plan to search individual files, pack them into tar or zip files,
- from the technical point of view, optimal "chunk" of packed data is 500 MB or bigger,
- don't use front-end servers for anything else than moving several small files! Submit a regular job and/or take an interactive job instead to handle with the archival data, e.g.
qsub -I -l select=1:ncpus=1:mem=2gb:scratch_local=2gb -l walltime=48:00:00,
- keep in mind that the master HOME directory of each HSM storage is dedicated just for initialization scripts, and thus has a limited quota of just 50 MB.
The tar (tape archiver) is a Linux command to pack files and directories into one file, a packed archive. tar by itself does not compress the size of the files, and the resulting volume of the packed archive is (roughly) the same as the sum of the volumes of individual files. Tar can cooperate with commands for file compression like gzip.
In all examples, the option v in tar command options means "verbose", giving a more detailed output about how the archiving progresses.
- In /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive, create (tar c) uncompressed archive of the directory named (tar f)
~/my-archiveand its content:
tar cvf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz ~/my-archive
- In /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive, create archive of the directory named
~/my-archiveand compress it by gzip command (tar z):
tar czvf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz ~/my-archive
- List (tar t) the content of the existing archive:
tar tzf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz
- Unpack the WHOLE archive
storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/into current directory:
tar xzvf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz
- Unpack PART of the archive:
tar tzf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz # list the content of the archive # unpack only file PATH1/file1 and directory PATH2/dir2 into the current directory tar xzvf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz "PATH1/file1" "PATH2/dir2"
There are many other options to customize the tar command. For the full description, read manual pages (man tar).
Sharing data in group
If you want to share your data in a group, follow these instructions Sharing data in group