Faer-rs: Linear algebra foundation for Rust

arijun · 2024-04-24T15:18:37

In a few fields of rust we are starting to see a convergence of lower level libraries that then can be shared amongst the higher level crates. For example, wgpu is seeing broad use across a bunch of libraries from game engines to UI libraries. This then allows the shared libraries to be made more robust, with more shared resources going into them.

Does anyone know how much of this is happening in the matrix/array space in rust? There are several libraries that have overlapping goals: ndarray, nalgebra, etc. How much do they share in terms of underlying code? Do they share data structures, or is anything like that on the horizon?

reply

sarah-ek · 2024-04-24T15:39:46

as far as i know, very little is shared. ndarray-linalg is mostly a lapack wrapper. nalgebra and faer both implement the algorithms from scratch, with nalgebra focusing more on smaller matrices

antimora · 2024-04-24T15:34:36

WGPU is also used by Burn deep learning framework and WONNX for computation.

heinrichhartman · 2024-04-24T15:04:22

This does not seem to depend on BLAS/LAPACK.

Good to see LU decomposition with full pivoting being implemented here (which is missing from BLAS/LAPACK). This gives a fast, numerically stable way to compute the rank of a matrix (with a basis of the kernel and image spaces). Details: https://www.heinrichhartmann.com/posts/2021-03-08-rank-decom....

reply

sarah-ek · 2024-04-24T15:05:55

lapack does expose a full pivoting lu as far as i can tell? https://netlib.org/lapack/explore-html/d8/d4d/group__getc2_g...

bee_rider · 2024-04-24T15:24:39

If you are going to include vs MKL benchmarks in your repo, full pivoting LU might be one to consider. I think most people are happy with partial pivoting, so I sorta suspect Intel hasn’t heavily tuned their implementation, might be room to beat up on the king of the hill, haha.

sarah-ek · 2024-04-24T15:34:17

funny you mention that, the full pivoting. it's one of the few benchmarks where faer wins by a huge margin

n faer mkl openblas

1024 27.06 ms 186.33 ms 793.26 ms

1536 73.57 ms 605.71 ms 2.65 s

2048 280.74 ms 1.53 s 8.99 s

2560 867.15 ms 3.31 s 17.06 s

3072 1.87 s 6.13 s 55.21 s

3584 3.42 s 10.18 s 71.56 s

4096 6.11 s 15.70 s 168.88 s

reply

adgjlsfhk1 · 2024-04-24T16:11:20

it might be interesting to add butterfly lu https://arxiv.org/abs/1901.11371. it's a way of doing a numerically stable lu like factorization without any pivoting, which allows it to parallelize better.

bee_rider · 2024-04-24T16:30:19

It looks like they are describing a preconditioner there.

adgjlsfhk1 · 2024-04-24T17:00:42

the key point is that the preconditioner allows you to skip pivoting which is really nice because the pivoting introduces a lot of data dependence.

sarah-ek · 2024-04-24T16:19:50

looks interesting! thanks for sharing

heinrichhartman · 2024-04-24T17:50:20

Thanks for pointing this out. Looks like only the python bindings are not included in nunpy.

sevagh · 2024-04-24T21:51:54

On the contrary, it seemingly can be used to make a BLAS implementation (example in a PR: https://github.com/sarah-ek/faer-rs/pull/37)

bmitc · 2024-04-24T19:48:42

Is there a collection of algorithms published somewhere, in a coherent manner? It seems like to me, there should be a collection of matrix algebra, symbolic algebra, etc. algorithms somewhere that makes it easy to implement these in any language or system. As it is, it seems you need to already be an expert or researcher in the space to make any movement.

sevagh · 2024-04-24T21:53:41

BLIS is an interesting new direction in that regard: https://github.com/flame/blis

>The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels.

reply

bmitc · 2024-04-25T16:06:27

Interesting. Thanks for the heads up!

rvdca · 2024-04-24T20:13:37

One could argue that libraries like BLAS/LAPACK are those collections...

bmitc · 2024-04-24T20:31:47

That can be argued, but they're pretty esoteric. For example, what language are they written in? Fortran, assembly, etc.? Probably nothing the majority of developers or mathematicians can't understand. And creating bindings for them is non-trivial. I don't even know where to start from the website. And on the website, all the links to the routines are broken. https://www.netlib.org/blas/

1980phipsi · 2024-04-25T19:36:45

The naming convention is a bit esoteric, but once you figure it out it's not too bad. I don't think writing bindings is too hard once you know what's going on. You most likely would want to write some kind of wrapper on top of that in order to make things a little more user-friend.

I will agree with you on the routines being broken. That happened within the past year. Not sure why it happened, but it's annoying. If you know the name of the routine you can usually find documentation on another site.

reply

bayindirh · 2024-04-24T14:03:59

Why Eigen is not run in parallel mode w/ Open-MP?

Eigen handle most (if not all, I just skimmed the tables) tasks in parallel [0]. Plus, it has hand-tuned SIMD code inside, so it needs "-march=native -mtune=native -O3" to make it "full send".

Some solvers' speed change more than 3x with "-O3", to begin with.

This is the Eigen benchmark file [1].

[0]: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html

[1]: https://github.com/sarah-ek/faer-rs/blob/main/faer-bench/eig...

reply

sarah-ek · 2024-04-24T14:08:41

author here, eigen is compiled with -fopenmp, which enables parallelism by default

bayindirh · 2024-04-24T14:10:34

Hi! Thanks for chiming in!

Did you check with resource utilization? If you don't provide "OMP_NUM_THREADS=n", Eigen doesn't auto-parallelize by default.

reply

sarah-ek · 2024-04-24T14:11:13

i did check, yes

dannyz · 2024-04-24T14:15:40

Related on the Eigen benchmarking, I see a lot of use of auto in the benchmarks. Eigen does not recommend using it like this (https://eigen.tuxfamily.org/dox/TopicPitfalls.html) because the template expressions can be quite complicated. I'm not sure if it matters here or not, but it would probably better not to use it in the benchmarks.

sarah-ek · 2024-04-24T14:16:47

i've contributed to eigen in the past and know enough about the internals of the codebase to know my way around safe `auto` usage

dannyz · 2024-04-24T14:21:20

I wasn't worried about safe usage, more that some of the initialization may be moved inside the benchmarking function instead of outside of it like intended. I'm sure you know more about it than me though.

bayindirh · 2024-04-24T14:25:17

While auto is a compile time burden, it creates a lot of load during compilation of this benchmark.

My complete Ph.D., using ton of Eigen components plus other libraries was compiling in 10 seconds flat on a way older computer. This requires gigs of RAM plus a minute.

reply

sarah-ek · 2024-04-24T14:28:47

the long compile times are mostly because im instantiating every dense decomposition in the library in one translation unit, for several data types (f32, f64, f128, c32, c64, c128)

owlbite · 2024-04-24T15:33:17

Something looks dubious with the benchmarking here to me.

Top-tier numerical linear algebra libraries hold all hit the same number (give or take a few percent) for matrix multiply, because they're all achieving the same hardware peak performance.

reply

sarah-ek · 2024-04-24T15:38:51

one issue that may be affecting the result is that openmp's threadpool doesn't play well with rayon's. i've seen some perf degradation in the past (on both sides) when both are used in the same program

i plan to address that after refactoring the benches by executing each library individually

reply

jimbobraggins · 2024-04-24T20:04:23

Very cool project! I'd suggest before running the new benchmarks to reach out to to the developers of the packages you are testing against to see if they think the benchmark you wrote is doing efficient calling conventions. I work on a large open source software project and we've had people claim they are 10x faster than us while they were really using our code in some very malformed ways.

Also stops them from grumbling after you post good results!

reply

sarah-ek · 2024-04-24T22:07:10

fair enough. i try to stay in touch with the eigen and nalgebra developers so i have a good idea on how to write code efficiently with them. for openblas and mkl i've been trying recently to call into the lapack api (benches doing that are still unpublished at the moment), that way im using a universal interface for that kinda stuff.

and of course i do check the cpu utilization to make sure that all threads are spinning for multithreaded benchmarks, and occasionally check the assembly of the hot loops to make sure that the libraries were built properly and are dispatching to the right code. (avx2, avx512, etc) so overall i try to take it seriously and i'll give credit where credit is due when it turns out another library is faster

reply

meisel · 2024-04-24T14:13:19

Looking at thin matrix SVD, it appears much faster than everyone else. I’m curious what it’s doing differently at a high level and if there’s any tradeoff in accuracy. I also wonder how it compares to MKL, which is typically the winner in all these benchmarks on Intel.

sarah-ek · 2024-04-24T14:15:05

im in the process of refactoring the benchmark code at the moment, and plan to include mkl in the benches soon.

overall, the results show that faer is usually faster, or even with openblas, and slower than mkl on my desktop

reply

mjhay · 2024-04-24T14:35:44

Wow, that's impressive! I wouldn't expect anything to be able to beat MKL, given the optimizations made based on proprietary information.

andyferris · 2024-04-24T13:39:03

Does anyone understand the difference (in the benchmark tables) between faer and faer(par)? Which number should be considered important?

royjacobs · 2024-04-24T13:41:29

If you look at the benchmark code [0] you can see it indicates whether Rayon is used for parallelism.

[0] https://github.com/sarah-ek/faer-rs/blob/main/faer-bench/src...

reply

llimllib · 2024-04-24T13:43:30

the former is run with no parallelism, and the latter with `parallelism::Rayon(0)`, which presumably means that it has some parallelism, though I don't write rust or know what Rayon is.

[1]: https://github.com/sarah-ek/faer-rs/blob/172f651fafe625a2534...

reply

nindalf · 2024-04-24T13:52:31

Rayon is a crate that makes it easy to write code that executes in parallel. Take a function that looks like this. It'll execute on a single thread.

fn sum_of_squares(input: &[i32]) -> i32 { input.iter() .map(|&i| i * i) .sum() }

If you changed it to input.par_iter() then it executes in parallel on multiple threads.

reply

nateb2022 · 2024-04-24T13:49:57

Rayon is a library that helps to parallelize sequential computations while guaranteeing data-race freedom. Cf. https://docs.rs/rayon/latest/rayon/

loeg · 2024-04-24T15:39:49

A bit of a tangent, but the same author has something like a libdivide (C++) for Rust: https://github.com/sarah-ek/fastdiv . Cool.

snowAbstraction · 2024-04-25T06:48:09

For the larger performance diffs, has anyone looked into why? Are there a couple of common reasons? I'd really like to know. Thanks

sarah-ek · 2024-04-25T11:49:40

i have, yes. i can't speak for openblas or mkl, but im familiar with eigen and nalgebra's implementations to some extent

nalgebra doesn't use blocking, so decompositions are handled one column (or row) at a time. this is great for small matrices, but scales poorly for larger ones

eigen uses blocking for most decompositions, other than the eigendecomposition, but they don't have a proper threading framework. the only operation that is properly multithreaded is matrix multiplication using openmp (and the unstable tensor module using a custom thread pool)

reply

rayrrr · 2024-04-24T13:06:46

How exactly does this dovetail with https://github.com/rust-or/good_lp ? Will it be a replacement, an enhancement, or something else?

sestep · 2024-04-24T13:09:30

Seems like a different focus, no? Aren't linear programming solvers a much narrower domain than general linear algebra libraries?

fastneutron · 2024-04-24T13:23:01

Correct, linear algebra and linear programming are two very distinct things. The latter is a widely-used optimization technique, and computationally depends on the former, which is a general mathematical framework for essentially all numerical computing.

electrozav · 2024-04-24T13:39:35

That's code for linear programming (optimization) not for linear algebra

rayrrr · 2024-04-24T13:34:23

I asked because Faer describes itself as a low-level linear algebra library with a high-level wrapper, whereas good_lp describes itself as a high-level wrapper around many relevant low-level libraries.

constantcrying · 2024-04-24T13:57:05

Linear programming and linear algebra aren't the same thing.

Ericson2314 · 2024-04-24T14:29:20

Nixpkgs has a some pluggable BLAS/Lapack implementation infra. If this does offer a shim layer doing exactly that interace, it would be nice to see this packaged as a new alternative!

netwrt · 2024-04-24T14:23:40

Does this include an einsum function? I find that it makes complex operations easier to work with

sarah-ek · 2024-04-24T14:26:19

not yet. a tensor api is on the long term todo list, but it's a big undertaking and i'd like to focus on matrix operations for the time being

citizen_friend · 2024-04-24T14:07:27

Is there a rust API for LAPACK?

sanxiyn · 2024-04-24T19:50:09

Yes, see https://github.com/blas-lapack-rs.

jdeaton · 2024-04-24T23:02:15

Why not just call Eigen from rust

sarah-ek · 2024-04-24T23:06:31

because we can do better

1980phipsi · 2024-04-25T19:41:45

What features of Rust let you write faster code than C/C++/Fortran?

adgjlsfhk1 · 2024-04-25T19:59:01

most of it's just general programming niceness. If you have to spend a few hours to wrestle with make/bazel/etc every time you need to reach for a dependency, you don't depend on things and have to rewrite them yourself. If your programming language doesn't have good ways of writing generic code, you either have to write the code once per precision (and do all your bugfixes/perf improvements in triplicate) or do very hacky metaprogramming where you use Python to generate your low level code (yes the Fortran people actually do this https://github.com/aradi/fypp), or use the C preprocesser which is even worse.