RAPIDS 24.12 Introduces cuDF on PyPI, CUDA Unified Memory for Polars, and Faster GNNs

RAPIDS 24.12 introduces cuDF packages to PyPI, speeds up groupby aggregations and reading files from AWS S3, enables larger-than-GPU memory queries in the Polars GPU engine, and faster graph neural network (GNN) training on real-world graphs.

cuDF and RMM CUDA 12 packages are now available on PyPI

pip install \
    'cudf-cu12==24.12.*' \
    'dask-cudf-cu12==24.12.*'\
    'rmm-cu12==24.12.*'

Two charts comparing the speedup ratio for Polars GPU versus Polars CPU engines across the 22 queries from the PDS-H benchmark. In the RAPIDS 24.12 release, the Polars GPU engine can now efficiently process workloads that fit in combined GPU+CPU memory but would previously cause GPU out-of-memory errors. — *Figure 1. In RAPIDS 24.12, the Polars GPU engine can now efficiently process workloads that fit in combined GPU+CPU memory but would previously cause GPU out-of-memory errors*

import cudf, cupy, rmm
import time
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())

df = cudf.DataFrame({
    'key': cupy.ones(100_000_000),
    'payload': cupy.random.rand(100_000_000),
})

for _ in range(3):
    t0 = time.time()
    g = df.groupby('key').agg({'payload': 'sum'}).reset_index()
    t1 = time.time()

print(f'cudf {cudf.__version__}: {t1-t0:0.3f} seconds')

cudf 24.10.01: 0.177 seconds
cudf 24.12.01: 0.012 seconds
pandas 2.2.3:  1.463 seconds

Chart comparing the throughput (MB/s) of reading a subset of the Red-Pajama v2 dataset from an S3 bucket to a 4-GPU g5.12xlarge EC2 instance with the new KvikIO-based multi-threaded object read functionality. On this dataset, throughput increases significantly up to 128 threads. — *Figure 2. A new opt-in capability for multithreaded S3 object reads can significantly improve the overall performance and scalability of S3 reads in cuDF and Dask cuDF*

Degree distribution of a power-law graph, where the x-axis is the degree value, and the y-axis is the number of vertices with the given degree. Power-law graphs have a skewed degree distribution. Most vertices have very few neighbors, but a few key vertices have many neighbors and appear frequently in the edge list. — *Figure 3. The ogbn-papers100M dataset is a representative example of a power-law graph, where most vertices have a very small degree but a few vertices have a very high degree*

Chart comparing the average time per epoch to train a 3-layer GraphSAGE model with batch size 1,024 on different systems using the distributed (old) and hierarchy-based (new) gather operation. Across the system configurations, the hierarchy-based gather improves end-to-end performance by 29-41%. — *Figure 4. Using a hierarchy-based gather operation provides an end-to-end speedup of about 30-40% for a three-layer GraphSAGE model with batch size 1,024*

What's Hot

ഉൻവാൻ ശറഫിൽ വാഫി: പഞ്ചജ്ഞാന കൂട്ടുകൾ

RAPIDS 24.12 Introduces cuDF on PyPI, CUDA Unified Memory for Polars, and Faster GNNs

Alternative Assets for Swiss Pension Funds | Portfolio for the Future

RAPIDS 24.12 Introduces cuDF on PyPI, CUDA Unified Memory for Polars, and Faster GNNs

What's Hot

RAPIDS 24.12 Introduces cuDF on PyPI, CUDA Unified Memory for Polars, and Faster GNNs

cuDF and RMM CUDA 12 packages are now available on PyPI

Polars GPU engine: Expanded support for large data sizes

Chunked IO

CUDA Unified Memory

PDS-H benchmarks

cuDF performance improvements

Faster low-cardinality groupby in cuDF

Faster IO from AWS S3 (remote storage)

Faster GNN training with hierarchy-based gathers in WholeGraph

Conclusion

Related Posts

Subscribe to Updates