Gfptar

Automatically bundles a large number of input entries (files) into
multiple archive files in the output directory while transferring them in parallel.

Technical Overview

Technology for High-Speed Bundling of Numerous "Small" Files

In large-scale distributed file systems (Gfarm), it is recommended to increase individual file sizes & access multiple files in parallel to achieve efficient data access. However, in reality, there is a significant need to efficiently handle "large numbers of small files" such as log files & simulation intermediate data.

Gfptar is a parallel archive command developed to solve this challenge. It automatically bundles a large number of input entries (files) into multiple archive files in the output directory while transferring them in parallel.

Features & Strengths

Speed & Flexibility to Handle Hundreds of Millions of Files

1Overwhelming File Handling Capacity

Gfptar enables efficient access and processing even for hundreds of millions of files. By default, it splits files approximately every 200MB to create archives, maximizing parallelism.

2Fast List Retrieval & Restoration

Verifying the contents of created archives (list retrieval: -t option) can be performed quickly even with large numbers of files. Additionally, restoring archives (extraction: -x option) can also be done at high speed.

3Flexible Archive Updates & Fault Tolerance

  • Update (-u option)
    Only new files can be added to already created archives. Additionally, if archive creation is interrupted due to network failures or other issues, this option can be used to resume the archive process.
  • Append (-r option)
    When only adding files, the -r option, which executes faster than the -u option, can be used.
  • Customization
    While gzip compression is performed by default, flexible settings such as specifying other compression methods or excluding specific filenames are possible.

Usage Examples

Big Data Backup & Efficiency

Backup of Large-Scale Simulation Results
Efficiently & parallelly bundle hundreds of millions of small intermediate files into a single archive directory, completing stable backups quickly.
Recovery from Interruptions
Even if a system failure occurs during large-capacity archive creation, use the -u option to resume processing from the interruption point, significantly reducing work time.
Extracting Only Required Data
Instead of the entire archive, specific entries (files) can be specified & restored (-x option), making access to necessary data easier.

Target Needs & Users

HPC (High-Performance Computing) Users
Those who generate large numbers of small files as simulation or calculation results & need to efficiently manage, move, & store them.
Data Managers / IT Personnel
Those seeking fast & robust archive/backup methods for systems handling hundreds of millions of files.
Data Scientists
Those who want to manage dataset versions or bundle & move/share specific groups of intermediate files.

Other Technology & Development

Arrow icon