High-level dynamic languages such as Python, Julia, and R have been at the forefront of artificial intelligence/machine learning (AI/ML), data analysis, and interactive computing workflows in the last decade. Traditional high-performance computing (HPC) frameworks that power the underlying low-level computations for performance and scalability are written in compiled languages: C, C++, and Fortran.
[...] We analyze single node scalability on two systems hosted at the Oak Ridge Leadership Computing Facility (OLCF)1—Wombat, which uses Arm Ampere Neoverse CPUs and 2 NVIDIA A100 GPUs, and Crusher, which is equipped with AMD EPYC 7A53 CPUs and 8 MI250X GPUs and serves as a test bed for Frontier, the first exascale system on the TOP500 list.
[...] We run hand-rolled general matrix multiplication (GEMM) code for dense matrices using Julia, Python/Numba and Kokkos implementations and compare the performance with C for multithreaded CPU (OpenMP) and single GPU (CUDA/HIP) systems. GEMM is an important kernel in the Basic Linear Algebra Subprograms (BLAS) used across several deep learning AI frameworks, for which modern GPU architectures have been heavily optimized via tensor cores.
[...] For CPUs, Julia performance was comparable to C/OpenMP combined with LLVM-based ArmClang and AMDClang vendor compilers. For the AMD GPUs, Julia AMDGPU.jl performance was comparable to HIP. Nevertheless, there is still a performance gap on NVIDIA A100 GPUs for single-precision floating point cases.
[...] We observe that Python/Numba implementations still lack the support needed to reach comparable CPU and GPU performance on these systems, and AMD GPU support is deprecated.
Pre-print article:
William F. Godoy and Pedro Valero-Lara and T. Elise Dettling and Christian Trefftz and Ian Jorquera and Thomas Sheehy and Ross G. Miller and Marc Gonzalez-Tallada and Jeffrey S. Vetter and Valentin Churavy, Evaluating performance and portability of high-level programming models: Julia, Python/Numba, and Kokkos on exascale nodes, Accepted at the 28th HIPS workshop, held in conjunction with IPDPS 2023, 2023, 2303.06195, https://doi.org/10.48550/arXiv.2303.06195
(Score: 3, Informative) by RamiK on Monday March 27 2023, @02:00PM
( https://arxiv.org/pdf/2303.06195.pdf [arxiv.org] )
compiling...
(Score: 4, Insightful) by JoeMerchant on Monday March 27 2023, @06:45PM (2 children)
The algorithm you're implementing in the language matters far more than the language itself.
We had a slow performer in Matlab. Translated it to C++, was faster, but still a dog. Parallelized it, got 3.9x speedup on 4 cores, as expected. Candidate for supercomputer? Yeah, that was actually the plan since Matlab days.
I threw a wrench in that plan, actually: a code review. Discovered the core of the algorithm was 5 deep nested loops, which could be done much more efficiently as 4 deep nested loops with a little calculation in the core. Like 100x faster.
What was going to be implemented on a rack of 50-100 machines, parallelizing the algorithm for acceptable speed was now running acceptably fast on a 2 core laptop. Throwing it on a modest quad-core workstation it was faster than the fast end of the original requirement spec.
🌻🌻 [google.com]
(Score: 0) by Anonymous Coward on Monday March 27 2023, @07:00PM (1 child)
For numerical packages it is often wrappers to tuned libraries. In which case the development environment is also important - MATLAB is the gold standard as you can play with the data and commands interactively. It almost never bombs and when you program as the language was intended (block operations) it is fast. Octave is pretty damn good these days with its GUI, but not as robust or fast. Julia is sorely in need of an IDE. I know they don't prioritize this but it's too painful going back to vi and xterm, so I never get beyond trivial operations. exit... exit()... nuts.
(Score: 2) by turgid on Wednesday March 29 2023, @08:50PM
The Human Race is doomed.
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 2) by jb on Tuesday March 28 2023, @02:44AM
So they compared hand-optimised code in 2 high-level languages with using common libraries in lower-level languages and wondered why there wasn't much difference?
FFS, compare apples with apples! If you're going to hand optimise in one language then hand optimise in the language you're comparing it to too (and done by someone who's just as proficient at optimising plain C as your python etc. guy was at optimising that).
(Score: 1) by Zoot on Tuesday March 28 2023, @03:38AM (3 children)
"We run hand-rolled general matrix multiplication (GEMM) code
for dense matrices using Julia, Python/Numba and Kokkos
implementations and compare the performance with C for multithreaded
CPU (OpenMP) and single GPU (CUDA/HIP) systems."
But NOBODY DOES THIS! We use Python to drive massive optimized low-level C++ software and hyper parallel computing hardware because it is just the best most friendly and easy to use language for scripting and controlling OTHER stuff. That other stuff is not generally written in Python if it's performance critical. This is what Python is good at, wrapping some exotic, hard to use, low-level module with a happy little Python module that completely hides all the messy initialization and interface code.
Stupid paper is stupid.
Sure, if you want to prototype something and develop algorithms then It's way easier in Python too, but the performance won't be as good as some C++ or Rust optimized version and you would likely never use that in production. But most of what's happening on AI these days is all high-level Python that's sitting on top of super-optimized PyTorch etc. because you don't NEED to write anything new that's low level much of the time so your interface to the whole world of high-performance AI can simply be a Python one, giving you the best of both worlds.
(Score: 1) by jper on Tuesday March 28 2023, @12:19PM (1 child)
It doesn't use "plain" Python. It uses Numba which is a jit for numeric-intesnive Python code.
(Score: 3, Interesting) by RamiK on Tuesday March 28 2023, @05:49PM
On top of using Numba, all their python examples are using numpy which is warped c. So, in essence, the conclusions are:
1. Python has an unavoidable overhead even if all you're doing is calling C and using an optimized python jit.
2. Julia's jit doesn't have such an overhead and can, and does match C/C++ performance figures despite being nearly as high-level as python (some of the types are a bit more strict here and there).
Anyhow, since linking the repo clearly wasn't enough to get the discussion beyond the cliches, here's an arguably representative example that should speak for itself:
Python: https://github.com/williamfgc/simple-gemm/blob/main/python/GemmDenseBLAS/GemmDenseBLAS.py [github.com]
Julia: https://github.com/williamfgc/simple-gemm/blob/main/julia/GemmDenseBLAS/src/GemmDenseBLAS.jl [github.com]
C++: https://github.com/williamfgc/simple-gemm/blob/main/cpp/gemm-dense-blas.cpp [github.com] https://github.com/williamfgc/simple-gemm/blob/main/cpp/gemm-dense-common.cpp [github.com]
C: https://github.com/williamfgc/simple-gemm/blob/main/c/gemm-dense-blas64.c [github.com]
compiling...
(Score: 0) by Anonymous Coward on Tuesday March 28 2023, @03:39PM
Well Julia can be used "this way", so i kind of think it wins big time.