Compression
From SOMWiki
Why should I use Data Compression?
Data Compression allows us to compress large, statistically redundant files into smaller files. This effectively allows us to utilize our more expensive resources, such as drive space on the SAN, more efficiently.
For example, if you have a large dataset ('over 1GB in size'), and your usage of this particular dataset is low, perhaps once a quarter, we can save disk space on the cluster by compressing this particular file. You can uncompress this file when you are ready to access it again.
By implementing the use of data compression, we do not need to force the use of disk quotas.
What types of compression software are available for use?
Software compression software exists for every modern operating system. Some operating systems, such as Linux, Solaris, and Windows XP come equipped with the ability to compress software in a particular format.
| Software | Notes | Solaris | Linux | OS X | Windows |
| gzip | Standard UNIX utility, excellent compression, good crossplatform support | Yes | Yes | Yes | Winzip/Winrar Supports |
| bzip2 | Best overall patent free compression, this is the preferred utility | Yes | Yes | Yes | Winrar Supports |
| pbzip2 | This is a parallel version of the preferred utility | Yes | Yes | Yes | Winrar Supports |
| zip | PKZIP compression, good for text or recursive directories | Yes | Yes | Yes | Yes (Native in Windows XP) |
| compress | Legacy compress, not recommended | Yes | Yes | Yes | No |
| rar | Good for windows files, not recommended for UNIX usage | Yes | No | No | With Winrar utility |
| tar | Although not a comrpession utility, is the preferred method for archiving directories | Yes | Yes | Yes | Winzip/Winrar Supports |
How to use bzip2
Bzip2 is the best available method for compressing data. We suggest using bzip2 in conjunction with the tar utility to backup entire directory structures. In some cases, you can experience up to '90%' compression of your file.
How to compress a single file using bzip2
The syntax for compressing a single file is very simple: Note: 'use -9 on all compression, this is the maximum level of compress available'
bzip2 -9 <filename>
Example:
bzip2 -9 mydata.txt
This command will compress the file mydata.txt to mydata.txt.bz2 using the bzip2 utility.
How to compress a single file using pbzip2
The syntax for compressing a single file is very simple: Note: 'use -9 on all compression, this is the maximum level of compress available'
pbzip2 -9 <filename>
Example:
pbzip2 -9 mydata.txt
This command will compress the file mydata.txt to mydata.txt.bz2 using the bzip2 utility.
How to uncompress a single file using bunzip2
The syntax for uncompressing a single file is very simple:
bunzip2 <filename>
How to compress and archive an entire directory structure using bzip2 and tar
tar -jcvf <backup file.tar.bz2> <directory structure>
In the following example, we are going to backup the directory structure crsp_data into the mydata.tar.bz2 .
tar -jcvf ./mydata.tar.bz2 ./crsp_data
How to uncompress an archive of a directory structure made with bzip2 and tar
tar -jxvf <backup file.tar.bz2> -C <location>
Example:
tar -jxvf ./mydata.tar.bz2 -C /home/mydirectory/data
This will uncompress the file mydata.tar.bz2 and the directory structure contained inside the archive to /home/mydirectory/data
How To compress a file using gcomp (Bzip2 over the grid)
qsub /usr/local/bin/gcomp <file>
Example:
qsub /usr/local/bin/gcomp mydata.csv
This will offload a parallel bzip2 job to one of the cluster machines. This saves computing resources by choosing the least busy node. The parallel version of bzip2 will use all availible processors in the machine to help speed up compression.
How to use gzip
Gzip is an excellent overall compression utility. We suggest using gzip in conjunction with the tar utility to backup entire directory structures. In some cases, you can experience up to '70%' compression of your file.
How to compress a single file using gzip
The syntax for backing up a single file is very simple: Note: 'use -9 on all compression, this is the maximum level of compress available'
gzip -9 <filename>
Example:
gzip -9 mydata.txt
This command will compress the file mydata.txt to mydata.txt.gz using the gzip utility.
How to uncompress a single file using gunzip
The syntax for uncompressing a single file is very simple:
gunzip <filename>
How to compress and archive an entire directory structure using gzip and tar
tar -zcvf <backup file.tar.gz> <directory structure>
In the following example, we are going to backup the directory structure crsp_data into the mydata.tar.gz .
tar -zcvf ./mydata.tar.gz ./crsp_data
How to uncompress an archive of a directory structure made with gzip and tar
tar -zxvf <backup file.tar.gz> -C <location>
Example:
tar -zxvf ./mydata.tar.gz -C /home/mydirectory/data
This will uncompress the file mydata.tar.gz and the directory structure contained inside the archive to /home/mydirectory/data

