RAPIDS 24.12 introduces cuDF packages to PyPI, speeds up groupby
aggregations and reading files from AWS S3, enables larger-than-GPU memory queries in the Polars GPU engine, and faster graph neural network (GNN) training on real-world graphs.
cuDF and RMM CUDA 12 packages are now available on PyPI
Starting with the 24.12 release of RAPIDS, CUDA 12 builds of rmm
, cudf
, dask-cudf
, and all of their dependencies are now available on PyPI. As a result, installing these libraries no longer requires using –extra-index-url
and other configuration for pip
. Try it out:
pip install \
'cudf-cu12==24.12.*' \
'dask-cudf-cu12==24.12.*'\
'rmm-cu12==24.12.*'
This also means that Polars users no longer need to specify an extra index during installation to get GPU support: pip install polars[gpu]
just works.
This is the first step in an ongoing effort to make RAPIDS libraries available through pypi.org. Stay tuned for more.
Polars GPU engine: Expanded support for large data sizes
Together with Polars, we launched the Polars GPU engine built on cuDF in Open Beta in September 2024, bringing accelerated computing to Polars workloads that fit into GPU memory. However, processing datasets with the Polars GPU engine could lead to GPU out-of-memory (OOM) errors as data sizes increased.
The 24.12 release introduces two features to eliminate OOMs while maintaining good performance at large dataset sizes: chunked IO and CUDA Unified Memory.
Chunked IO
First, cudf-polars
now processes parquet files in chunks (8 GiB per chunk by default) while maintaining high decompression and decoding throughput.
Compressed files need to be expanded in memory, which can result in some workloads running out of memory during IO that would have otherwise succeeded once the data is in memory. This can be particularly significant for ZSTD compression, where the actual memory footprint is typically 3x the file size.
Chunked IO reduces this peak memory pressure, enabling more workload to succeed.
CUDA Unified Memory
Second, this release enables a managed memory pool with prefetching by default in cudf-polars
, similar to previous work in cudf.pandas
. For more details, see Scaling Up to One Billion Rows of Data in pandas using RAPIDS cuDF.
Unified Memory makes it possible for DataFrames to extend between GPU and host memory, efficiently handling data migration over the GPU interconnect (PCIe or C2C) to enable seamless processing.
With Unified Memory and prefetching, the Polars GPU engine can now handle much larger datasets without running out of memory.
PDS-H benchmarks
In the prior release, many PDS-H benchmark queries running with the Polars GPU engine would run out of memory on an NVIDIA H100 GPU as the Scale Factor (dataset size) increased beyond 80 GB or 100 GB.
With these enhancements, the Polars GPU engine can now efficiently process workloads that fit in combined GPU plus CPU memory but would previously cause GPU OOM errors.
Figure 1 compares the speedup ratio for Polars GPU to Polars CPU engines across the 22 queries from the PDS-H benchmark using RAPIDS 24.10 and 24.12. Based on a sweep across Scale Factors, queries that previously failed due to OOMs now succeed and the updated engine can now provide large speedups versus CPU-only infrastructure, even up to Scale Factor 250 (250 GB).
GPU: NVIDIA H100 | CPU: Dual socket Intel Xeon 8480CL (Sapphire Rapids) | Storage: Local NVMe. Note: PDS-H is derived from TPC-H but these results are not comparable to TPC-H results.
cuDF performance improvements
RAPIDS 24.12 also includes two performance optimizations designed to improve large-scale data processing workflows.
Faster low-cardinality groupby
in cuDF
This release includes a new optimization for hash-based groupby
aggregations in cuDF. Previously, processing groupby
aggregations with low cardinality (a small number of groups), resulted in relatively lower throughput.
The lower throughput was due to atomic contention when the number of groups fell below ~100. This bottleneck is now avoided by using GPU shared memory to track partial aggregates.
The following code example shows this effect in action:
import cudf, cupy, rmm
import time
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
df = cudf.DataFrame({
'key': cupy.ones(100_000_000),
'payload': cupy.random.rand(100_000_000),
})
for _ in range(3):
t0 = time.time()
g = df.groupby('key').agg({'payload': 'sum'}).reset_index()
t1 = time.time()
print(f'cudf {cudf.__version__}: {t1-t0:0.3f} seconds')
cudf 24.10.01: 0.177 seconds
cudf 24.12.01: 0.012 seconds
pandas 2.2.3: 1.463 seconds
Executing on a system with an NVIDIA H100 GPU and a Intel Xeon Platinum 8480CL CPU delivers a 15x speedup from cuDF 24.10 to 24.12 for this use case.
This optimization will be particularly useful for Spark and Dask users, where large workloads may require aggregation results over a small number of categories.
Faster IO from AWS S3 (remote storage)
cuDF 24.12 introduces a new, opt-in capability for multithreaded S3 object reads as part of cuDF and Dask-cuDF.
The new functionality is based on libcurl usage within KvikIO to manage reads against S3 objects and can significantly improve the overall performance and scalability of S3 reads in Dask cuDF. It can be enabled with the CUDF_KVIKIO_REMOTE_IO
environment variable and controlled with KVIKIO_NTHREADS
.
In Figure 2, KvikIO (through CUDF_KVIKIO_REMOTE_IO=ON
) is used to read a subset of the Red-Pajama v2 dataset from an S3 bucket to a 4-GPU g5.12xlarge EC2 instance with varying number of threads.
In this case, the overall throughput is 963 MiB/s with KVIKIO_NTHREADS=128
, compared to 450 MiB/s without using KvikIO.
Because every workload is different, the number of threads is configurable. We’re currently investigating the appropriate default configuration across a range of datasets and systems, so this feature is opt-in for now but ready for wide usage.
Faster GNN training with hierarchy-based gathers in WholeGraph
This release also introduces significant performance improvements to WholeGraph through a new communications optimization designed for power-law graphs.
A power-law graph has a degree distribution similar to the one shown in Figure 3. Most vertices have a very small degree (with many having only a single neighbor), while a few vertices have a very high degree and are highly connected. Most real-world graphs fall into this category.
When sampling a power-law graph, the highest-degree vertices can appear multiple times within a batch. For the ogbn-papers100M dataset, the percentage of repeated vertices can reach up to 70% when training with eight GPUs.
By only reading the feature tensors of repeated vertices once, we can save a significant amount of time during the feature fetching stage. This is known as a hierarchy-based gather operation, as opposed to the default distributed gather operation.
Using a hierarchy-based gather operation in these scenarios yields up to 2x speedup over a distributed gather operation. The faster gather step provides an end-to-end speedup of about 30-40% for a three-layer GraphSAGE model with batch size 1,024.
System(s): NVIDIA DGX H100
Conclusion
The RAPIDS 24.12 release brings significant user experience improvements and performance optimizations to both DataFrame analytics and GNN training. The Polars GPU engine is in Open Beta and the multithreaded AWS S3 file reads are experimental. We welcome your feedback on GitHub. You can also join the 3,500+ members of the RAPIDS Slack community to talk GPU-accelerated data processing.
If you’re new to RAPIDS, check out these resources to get started.