Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 15 submissions in the queue.
posted by janrinok on Monday March 27 2023, @01:04PM   Printer-friendly
from the c-language-still-runs-the-world dept.

Julia and Kokkos perform comparably with C/OpenMP on CPUs, while Julia implementations are competitive with CUDA and HIP on GPUs:

High-level dynamic languages such as Python, Julia, and R have been at the forefront of artificial intelligence/machine learning (AI/ML), data analysis, and interactive computing workflows in the last decade. Traditional high-performance computing (HPC) frameworks that power the underlying low-level computations for performance and scalability are written in compiled languages: C, C++, and Fortran.

[...] We analyze single node scalability on two systems hosted at the Oak Ridge Leadership Computing Facility (OLCF)1—Wombat, which uses Arm Ampere Neoverse CPUs and 2 NVIDIA A100 GPUs, and Crusher, which is equipped with AMD EPYC 7A53 CPUs and 8 MI250X GPUs and serves as a test bed for Frontier, the first exascale system on the TOP500 list.

[...] We run hand-rolled general matrix multiplication (GEMM) code for dense matrices using Julia, Python/Numba and Kokkos implementations and compare the performance with C for multithreaded CPU (OpenMP) and single GPU (CUDA/HIP) systems. GEMM is an important kernel in the Basic Linear Algebra Subprograms (BLAS) used across several deep learning AI frameworks, for which modern GPU architectures have been heavily optimized via tensor cores.

[...] For CPUs, Julia performance was comparable to C/OpenMP combined with LLVM-based ArmClang and AMDClang vendor compilers. For the AMD GPUs, Julia AMDGPU.jl performance was comparable to HIP. Nevertheless, there is still a performance gap on NVIDIA A100 GPUs for single-precision floating point cases.

[...] We observe that Python/Numba implementations still lack the support needed to reach comparable CPU and GPU performance on these systems, and AMD GPU support is deprecated.

Pre-print article:
William F. Godoy and Pedro Valero-Lara and T. Elise Dettling and Christian Trefftz and Ian Jorquera and Thomas Sheehy and Ross G. Miller and Marc Gonzalez-Tallada and Jeffrey S. Vetter and Valentin Churavy, Evaluating performance and portability of high-level programming models: Julia, Python/Numba, and Kokkos on exascale nodes, Accepted at the 28th HIPS workshop, held in conjunction with IPDPS 2023, 2023, 2303.06195, https://doi.org/10.48550/arXiv.2303.06195


Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Informative) by RamiK on Monday March 27 2023, @02:00PM

    by RamiK (1813) on Monday March 27 2023, @02:00PM (#1298330)

    APPENDIX A
    ARTIFACT DESCRIPTION FOR REPRODUCIBILITY
    The code used for this study is hosted on GitHub: https://github.com/williamfgc/simple-gemm [github.com].
    Each implementation has its own directory: C, Kokkos, Julia, and Python. The scripts directory contains the configurations for each experiment on OLCF systems. Figures 8 and 9 show examples of scripts to run C/OpenMP and Julia experiments on Wombat.

    ( https://arxiv.org/pdf/2303.06195.pdf [arxiv.org] )

    --
    compiling...
  • (Score: 4, Insightful) by JoeMerchant on Monday March 27 2023, @06:45PM (2 children)

    by JoeMerchant (3937) on Monday March 27 2023, @06:45PM (#1298368)

    The algorithm you're implementing in the language matters far more than the language itself.

    We had a slow performer in Matlab. Translated it to C++, was faster, but still a dog. Parallelized it, got 3.9x speedup on 4 cores, as expected. Candidate for supercomputer? Yeah, that was actually the plan since Matlab days.

    I threw a wrench in that plan, actually: a code review. Discovered the core of the algorithm was 5 deep nested loops, which could be done much more efficiently as 4 deep nested loops with a little calculation in the core. Like 100x faster.

    What was going to be implemented on a rack of 50-100 machines, parallelizing the algorithm for acceptable speed was now running acceptably fast on a 2 core laptop. Throwing it on a modest quad-core workstation it was faster than the fast end of the original requirement spec.

    --
    🌻🌻 [google.com]
    • (Score: 0) by Anonymous Coward on Monday March 27 2023, @07:00PM (1 child)

      by Anonymous Coward on Monday March 27 2023, @07:00PM (#1298369)

      For numerical packages it is often wrappers to tuned libraries. In which case the development environment is also important - MATLAB is the gold standard as you can play with the data and commands interactively. It almost never bombs and when you program as the language was intended (block operations) it is fast. Octave is pretty damn good these days with its GUI, but not as robust or fast. Julia is sorely in need of an IDE. I know they don't prioritize this but it's too painful going back to vi and xterm, so I never get beyond trivial operations. exit... exit()... nuts.

  • (Score: 2) by jb on Tuesday March 28 2023, @02:44AM

    by jb (338) on Tuesday March 28 2023, @02:44AM (#1298436)

    We run hand-rolled general matrix multiplication (GEMM) code for dense matrices using Julia, Python/Numba and Kokkos implementations and compare the performance with C for multithreaded CPU (OpenMP) and single GPU (CUDA/HIP) systems.

    So they compared hand-optimised code in 2 high-level languages with using common libraries in lower-level languages and wondered why there wasn't much difference?

    FFS, compare apples with apples! If you're going to hand optimise in one language then hand optimise in the language you're comparing it to too (and done by someone who's just as proficient at optimising plain C as your python etc. guy was at optimising that).

  • (Score: 1) by Zoot on Tuesday March 28 2023, @03:38AM (3 children)

    by Zoot (679) on Tuesday March 28 2023, @03:38AM (#1298440)

    "We run hand-rolled general matrix multiplication (GEMM) code
        for dense matrices using Julia, Python/Numba and Kokkos
        implementations and compare the performance with C for multithreaded
        CPU (OpenMP) and single GPU (CUDA/HIP) systems."

    But NOBODY DOES THIS! We use Python to drive massive optimized low-level C++ software and hyper parallel computing hardware because it is just the best most friendly and easy to use language for scripting and controlling OTHER stuff. That other stuff is not generally written in Python if it's performance critical. This is what Python is good at, wrapping some exotic, hard to use, low-level module with a happy little Python module that completely hides all the messy initialization and interface code.

    Stupid paper is stupid.

    Sure, if you want to prototype something and develop algorithms then It's way easier in Python too, but the performance won't be as good as some C++ or Rust optimized version and you would likely never use that in production. But most of what's happening on AI these days is all high-level Python that's sitting on top of super-optimized PyTorch etc. because you don't NEED to write anything new that's low level much of the time so your interface to the whole world of high-performance AI can simply be a Python one, giving you the best of both worlds.

(1)