In a few fields of rust we are starting to see a convergence of lower level libraries that then can be shared amongst the higher level crates. For example, wgpu is seeing broad use across a bunch of libraries from game engines to UI libraries. This then allows the shared libraries to be made more robust, with more shared resources going into them.
Does anyone know how much of this is happening in the matrix/array space in rust? There are several libraries that have overlapping goals: ndarray, nalgebra, etc. How much do they share in terms of underlying code? Do they share data structures, or is anything like that on the horizon?
as far as i know, very little is shared. ndarray-linalg is mostly a lapack wrapper. nalgebra and faer both implement the algorithms from scratch, with nalgebra focusing more on smaller matrices
Good to see LU decomposition with full pivoting being implemented here (which is missing from BLAS/LAPACK). This gives a fast, numerically stable way to compute the rank of a matrix (with a basis of the kernel and image spaces). Details: https://www.heinrichhartmann.com/posts/2021-03-08-rank-decom....
If you are going to include vs MKL benchmarks in your repo, full pivoting LU might be one to consider. I think most people are happy with partial pivoting, so I sorta suspect Intel hasn’t heavily tuned their implementation, might be room to beat up on the king of the hill, haha.
it might be interesting to add butterfly lu https://arxiv.org/abs/1901.11371. it's a way of doing a numerically stable lu like factorization without any pivoting, which allows it to parallelize better.
Is there a collection of algorithms published somewhere, in a coherent manner? It seems like to me, there should be a collection of matrix algebra, symbolic algebra, etc. algorithms somewhere that makes it easy to implement these in any language or system. As it is, it seems you need to already be an expert or researcher in the space to make any movement.
>The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels.
That can be argued, but they're pretty esoteric. For example, what language are they written in? Fortran, assembly, etc.? Probably nothing the majority of developers or mathematicians can't understand. And creating bindings for them is non-trivial. I don't even know where to start from the website. And on the website, all the links to the routines are broken. https://www.netlib.org/blas/
The naming convention is a bit esoteric, but once you figure it out it's not too bad. I don't think writing bindings is too hard once you know what's going on. You most likely would want to write some kind of wrapper on top of that in order to make things a little more user-friend.
I will agree with you on the routines being broken. That happened within the past year. Not sure why it happened, but it's annoying. If you know the name of the routine you can usually find documentation on another site.
Eigen handle most (if not all, I just skimmed the tables) tasks in parallel [0]. Plus, it has hand-tuned SIMD code inside, so it needs "-march=native -mtune=native -O3" to make it "full send".
Some solvers' speed change more than 3x with "-O3", to begin with.
Related on the Eigen benchmarking, I see a lot of use of auto in the benchmarks. Eigen does not recommend using it like this (https://eigen.tuxfamily.org/dox/TopicPitfalls.html) because the template expressions can be quite complicated. I'm not sure if it matters here or not, but it would probably better not to use it in the benchmarks.
I wasn't worried about safe usage, more that some of the initialization may be moved inside the benchmarking function instead of outside of it like intended. I'm sure you know more about it than me though.
While auto is a compile time burden, it creates a lot of load during compilation of this benchmark.
My complete Ph.D., using ton of Eigen components plus other libraries was compiling in 10 seconds flat on a way older computer. This requires gigs of RAM plus a minute.
the long compile times are mostly because im instantiating every dense decomposition in the library in one translation unit, for several data types (f32, f64, f128, c32, c64, c128)
Something looks dubious with the benchmarking here to me.
Top-tier numerical linear algebra libraries hold all hit the same number (give or take a few percent) for matrix multiply, because they're all achieving the same hardware peak performance.
one issue that may be affecting the result is that openmp's threadpool doesn't play well with rayon's. i've seen some perf degradation in the past (on both sides) when both are used in the same program
i plan to address that after refactoring the benches by executing each library individually
Very cool project! I'd suggest before running the new benchmarks to reach out to to the developers of the packages you are testing against to see if they think the benchmark you wrote is doing efficient calling conventions. I work on a large open source software project and we've had people claim they are 10x faster than us while they were really using our code in some very malformed ways.
Also stops them from grumbling after you post good results!
fair enough. i try to stay in touch with the eigen and nalgebra developers so i have a good idea on how to write code efficiently with them. for openblas and mkl i've been trying recently to call into the lapack api (benches doing that are still unpublished at the moment), that way im using a universal interface for that kinda stuff.
and of course i do check the cpu utilization to make sure that all threads are spinning for multithreaded benchmarks, and occasionally check the assembly of the hot loops to make sure that the libraries were built properly and are dispatching to the right code. (avx2, avx512, etc) so overall i try to take it seriously and i'll give credit where credit is due when it turns out another library is faster
Looking at thin matrix SVD, it appears much faster than everyone else. I’m curious what it’s doing differently at a high level and if there’s any tradeoff in accuracy. I also wonder how it compares to MKL, which is typically the winner in all these benchmarks on Intel.
the former is run with no parallelism, and the latter with `parallelism::Rayon(0)`, which presumably means that it has some parallelism, though I don't write rust or know what Rayon is.
i have, yes. i can't speak for openblas or mkl, but im familiar with eigen and nalgebra's implementations to some extent
nalgebra doesn't use blocking, so decompositions are handled one column (or row) at a time. this is great for small matrices, but scales poorly for larger ones
eigen uses blocking for most decompositions, other than the eigendecomposition, but they don't have a proper threading framework. the only operation that is properly multithreaded is matrix multiplication using openmp (and the unstable tensor module using a custom thread pool)
Correct, linear algebra and linear programming are two very distinct things. The latter is a widely-used optimization technique, and computationally depends on the former, which is a general mathematical framework for essentially all numerical computing.
I asked because Faer describes itself as a low-level linear algebra library with a high-level wrapper, whereas good_lp describes itself as a high-level wrapper around many relevant low-level libraries.
Nixpkgs has a some pluggable BLAS/Lapack implementation infra. If this does offer a shim layer doing exactly that interace, it would be nice to see this packaged as a new alternative!
most of it's just general programming niceness. If you have to spend a few hours to wrestle with make/bazel/etc every time you need to reach for a dependency, you don't depend on things and have to rewrite them yourself. If your programming language doesn't have good ways of writing generic code, you either have to write the code once per precision (and do all your bugfixes/perf improvements in triplicate) or do very hacky metaprogramming where you use Python to generate your low level code (yes the Fortran people actually do this https://github.com/aradi/fypp), or use the C preprocesser which is even worse.
Does anyone know how much of this is happening in the matrix/array space in rust? There are several libraries that have overlapping goals: ndarray, nalgebra, etc. How much do they share in terms of underlying code? Do they share data structures, or is anything like that on the horizon?
reply