05/07/2021 | News release | Archived content
Recently I've seen articles trying to sow FUD (fear, uncertainty, and doubt) where BeeGFS and other parallel file systems fit for machine learning and deep learning workloads, so I thought I'd do my part to help set the record straight. This post is an attempt to summarize all the 'that's not quite right' moments I've had when reading what others have to say about BeeGFS online. If you're not familiar with BeeGFS, it's a radically simple-to-use parallel file system (PFS) that can scale from the smallest artificial intelligence proof of concept to meet the most demanding super-sized production requirements.
Claims and facts about BeeGFS
Lately it seems that some people have a bone to pick with BeeGFS. Perhaps they're intimidated by the available source model that allows anyone to get started with no up-front cost, then purchase enterprise support from the vendor of their choice when they're ready to scale to production. Maybe they're afraid of disruptive innovation. Regardless of the intention, some specifically cited 'drawbacks' when considering BeeGFS for AI include the following claims.
Claim: AI requires small file access with extremely low latency, and BeeGFS doesn't support new network protocols.
Fact: On the front end, BeeGFS supports InfiniBand, which can deliver extremely low latency to GPU-based systems (no starvation of expensive GPUs here). Support for RDMA minimizes CPU cycles spent on network communication on your GPU nodes (source).
Claim: BeeGFS does not support NVMe over Fabrics and is limited to legacy storage interfaces such as SAS, SATA, and FC.
Fact: On the backend, BeeGFS nodes store data chunks and metadata on block devices. Those devices can be anything from internal drives (NVMe SSDs or otherwise) to external storage systems like NetApp® E-Series attached over NVMe-oF, or virtually any SAN protocol (source).
Claim: BeeGFS is commonly used as a scratch-space file system, and in machine learning (ML) use cases the cost of data acquisition is so high data has to be fully protected.
Fact: Although BeeGFS is often used as scratch space, this is more of a historical use, possibly due to the low cost of entry. NetApp storage provides durability for data in BeeGFS, and NetApp has developed a high-availability solution for BeeGFS to provide failover in case of BeeGFS node failure.
Even if you only consider it as 'scratch storage,' there are still many AI use cases for BeeGFS. Consider the case where data is stored elsewhere, for example object storage, and then pulled into a faster storage medium like BeeGFS for preprocessing and training. The cost competitiveness of BeeGFS makes it an ideal candidate for these types of cases where data needs to be duplicated, temporarily or permanently.
Claim: BeeGFS was designed for research environments but it doesn't scale to meet the needs of commercial high-performance computing, including AI and ML.
Fact: There are BeeGFS deployments up to 30PB, with no theoretical capacity limits in sight. From a performance standpoint, Lawrence Livermore National Laboratory has published a paper concluding that even though data pipelines in deep learning frameworks can put tremendous pressure on PFSs, BeeGFS reasonably handles the I/O access patterns (source). NetApp's benchmarks have also concluded that BeeGFS is more than capable of keeping GPUs fully saturated (source).
Claim: BeeGFS needs separate management and metadata servers.
Fact: Access to the management daemon is not relevant for file system performance, and BeeGFS recommends running this daemon on any of the storage or metadata servers (source). Also, it's possible to run all of the BeeGFS services on a single node, which is a great way to try out BeeGFS (source).
Claim: The metadata server is often a performance bottleneck, and users will not enjoy the design benefits of a PFS like BeeGFS when working with lots of small files.
Fact: BeeGFS intelligently distributes metadata on a per-directory basis rather than per directory tree (source). To maximize metadata performance, BeeGFS recommends using hierarchical file organization; thus, the directory structure of datasets like ImageNet assist in the performance of BeeGFS. IO500 and other benchmarking tests can certainly be contrived to achieve poor metadata results by working against best practices for file organization.
Claim: PFSs like BeeGFS are not a good fit for the typical read-only workload associated with AI training.
Fact: Unlike some storage solutions, BeeGFS can stripe files across multiple nodes, ensuring that no single node becomes a bottleneck when trying to read the same file from large numbers of GPU servers (source).
Drawbacks and disadvantages of BeeGFS… or clever misinterpretations and marketing?
Other drawbacks often cited are the lack of enterprise features and support for enterprise tasks, in particular calling out backup, data tiering, encryption, user authentication, and quotas:
Also, it is possible to access files in BeeGFS over NFS, SMB, or even S3 using separate services. Unfortunately, there is currently no built-in support for snapshots. If that is a deal breaker, let me introduce you to NetApp's ONTAP® data management software. Alternatively, there are open-source version control systems for ML projects that can track models and datasets.
In summary, most FUD I've seen around BeeGFS skews, misconstrues, or blatantly falsifies the facts to paint BeeGFS in a bad light. Don't be fooled, BeeGFS is well suited for AI, ML, and DL. Flashy statistics like IO500 placement are a poor selling point without comparing the hardware used to obtain the numbers.