As a grad student (and an ADHDer), I had trouble doing literature review systematically. To combat this, I made a website that finds similar papers using the meaning of the thing I am looking for.
I used MixedBread's [^1] embedding model to generate vectors from the abstracts. I store and search similar vectors using Milvus [^2] and finally use Gradio [^3] to serve the frontend. I update the vector database weekly by pulling the metadata dataset from Kaggle [^4].
To speed up the search process on my free oracle instance, I binarise the embeddings and use Hamming distance as a metric.
I would love your feedback on the site :)
Happy Holidays!
[1]: https://www.mixedbread.ai/docs/embeddings/mxbai-embed-large-...
[2]: https://milvus.io/
[3]: https://www.gradio.app/
[4]: https://www.kaggle.com/datasets/Cornell-University/arxiv
If you expand beyond arxiv, keep in mind since coverage matters for lit reviews, unfortunately the big publishers (Elsevier and Springer) are forcing other indices like OpenAlex, etc. to remove abstracts so they're harder to get.
Have you checked out other tools like undermind.ai, scite.ai, and elicit.org?
You might consider what else a dedicated product workflow for lit reviews includes besides search
(used to work at scite.ai)