Show HN: I made a website to semantically search ArXiv papers

shishy · 2024-12-25T11:19:34 1735125574

I enjoy seeing projects like this!

If you expand beyond arxiv, keep in mind since coverage matters for lit reviews, unfortunately the big publishers (Elsevier and Springer) are forcing other indices like OpenAlex, etc. to remove abstracts so they're harder to get.

Have you checked out other tools like undermind.ai, scite.ai, and elicit.org?

You might consider what else a dedicated product workflow for lit reviews includes besides search

(used to work at scite.ai)

Quizzical4230 · 2024-12-25T15:02:29 1735138949

Thank you for the appreciation and great feedback!

| If you expand beyond arxiv, keep in mind since coverage matters for lit reviews,

I do have PaperMatchBio [^1] for bioRxiv and PaperMatchMed [^2] for medRxiv, however I do agree having multiple sites for domains isn't ideal. And I am yet to create a synchronization pipeline for these two so the results may be a little stale.

| unfortunately the big publishers (Elsevier and Springer) are forcing other indices like OpenAlex, etc. to remove abstracts so they're harder to get.

This sounds like a real issue in expanding the coverage.

| Have you checked out other tools like undermind.ai, scite.ai, and elicit.org?

I did, but maybe not thoroughly enough. I will check these and add complementing features.

| You might consider what else a dedicated product workflow for lit reviews includes besides search

Do you mean a reference management system like Mendeley/Zotero?

[1]: https://papermatchbio.mitanshu.tech/ [2]: https://papermatchmed.mitanshu.tech/

eric-burel · 2024-12-25T15:16:27 1735139787

Unusual use case but I write literature reviews for French R&D tax cut system, and we specifically need to: focus on most recent papers, stay on topic for a very specific problematic a company has, potentially include grey literature (tech blog articles from renowned corp), be as exhaustive as possible when it comes to freely accessible papers (we are more ok with missing paid papers unless they are really popular). A "dedicated product workflow" could be about taking business use cases like that into account. This is a real business problem, the Google Scholar lock up is annoying and I would pay for something better than what exists.

dbmikus · 2024-12-26T01:40:38 1735177238

Hey, I'm not OP, but I'm working on what seems to be the exact problem you mentioned. We (https://fixpoint.co/) search and monitor web data about companies. We are indexing patents and academic papers right now, plus we can scrape and monitor just about any website (some social media sites not supported).

We have users with very similar use cases to yours. Want to email me? dylan@fixpoint.co. I'm one of the founders :)

Quizzical4230 · 2024-12-25T15:56:13 1735142173

This is quite unique. I believe a custom solution might help you better than Google Scholar.

eric-burel · 2024-12-25T20:42:25 1735159345

This can be seen as technology watch, as opposed to a thesis literature review for instance. Google Scholar gives the best results but sadly doesn't really want you to build products on top of it : no api, no scraping. Breaking this monopoly would be a huge step forward, especially when coupled with semantic search.

mattigames · 2024-12-26T00:22:07 1735172527

"|" it's a terrible character for signaling quotes, as it looks a bit too much like "I" or "l" and sometimes even "1" or "i" depending on the font used. I believe the greater-than symbol (>) is better suited for this task.

Quizzical4230 · 2024-12-26T05:08:57 1735189737

So true ;-; I was following the Gmail protocol. I will use > from now on. Happy Holidays :D

zackmorris · 2024-12-26T16:34:41 1735230881

Edit: I moved this here from top level.

The Cloudflare challenge screen at the beginning is a dealbreaker.

Random question - does anyone know why so many papers are missing from ArXiv? Do they need to be submitted manually, perhaps by their author(s)? I'll often find papers on mathematics, physics and computer science. But papers on biology, chemistry and medicine are usually missing.

I think a database of all paper ids in existence and where they're posted or missing could be at least as useful as this. Because no papers written with any level of public funding (meaning most of them) should ever be missing.

Quizzical4230 · 2024-12-26T19:01:50 1735239710

> The Cloudflare challenge screen at the beginning is a dealbreaker.

I understand your concern, however, I do not have the know-how to properly combat bots that keep spamming the server and this seemed the easiest way for me to have a functional site. I would love to know some resources for beginners in this regard, if you have them.

>Random question...

arXiv is generally for submitting CS, maths and physics papers. There are alternate preprint repositories like biorxiv.org, chemrxiv.org and medrxiv.org for such purposes. Note: arxiv is the largest, in terms of papers hosted, among these.

zackmorris · 2024-12-27T19:33:06 1735327986

Edit: thanks for those links! I'm somewhat out of the loop academically, so have been relying on search engines whose quality seems to be in decline.

-

Combatting bots with the Cloudflare challenge screen is an X/Y problem.

The central issue is that the web has been rolled out improperly, and the way that we build websites is incorrect. The web should have been decentralized, meaning that all public-facing pages would be public domain and hosted on a peer to peer (P2P) network that grows more powerful with the number of users, similarly to how BitTorrent works. We wouldn't concern ourselves with servers at the edge, since they would already be distributed around the world and implement the caching strategies that are already part of HTTP.

Which means for example that regions in AWS would be unnecessary, and Cloudflare and other content distribution networks (CDNs) would have no business model. Coral CDN was a free working example of automatic caching that ran up until a few years ago:

https://wiki.opensourceecology.org/wiki/Coral_CDN

https://en.wikipedia.org/wiki/Coral_Content_Distribution_Net...

https://cachedview.com

https://news.ycombinator.com/item?id=19020978

Note how it's mostly been erased from history due to ensh@ttification by FAANG.

It also means that web technologies we think of as core to how external resources are included are also incorrect. Rather than Cross-Origin Resource Sharing (CORS), we should be using Subresource Integrity (SRI). That would allow us to include scripts and other media files by hash instead of just location. That also removes most of the need for build processes like Webpack, Grunt, Gulp, etc, since scripts would import other scripts directly and let the Just in Time (JIT) compiler decide what is needed.

I can go on pretty much forever with this. In 1995 I was a student at the University of Illinois in Urbana-Champaign (UIUC) where NCSA Mosaic was developed, which Netscape copied the year before when it took the internet mainstream. Stuff like Server-Side Includes (SSI) showed promise in avoiding build tools by letting developers reuse code from other servers. But there wasn't full understanding then of how hashing makes strong security guarantees. In the meantime, Marc Andreessen and other billionaires took the quick and easy path, rolling out easier (but not simpler) technologies that maximize short-term profits instead of long-term prosperity and ease of maintenance through automation.

Without a true distributed web, the endgame of all this looks like what we're seeing today. Sites that can't be scraped by alternative search engines or machine learning tools. Sites that can't be viewed securely or anonymously with Tor Browser. Sites that keep everything behind a paywall or in walled gardens, which will cause most of today's human-produced media to eventually be lost to the digital dark age.

Fixing all of this is straightforward, but it would probably require us to return to traditional values. Basically contributing some of our incomes to universities and other institutions via our taxes, so that they can work to protect the interests of the masses, who have no benefactor because it's not profitable to help them.

Billionaires and other moneyed interests don't want this, so have done everything in their power to dismantle the commons, not just on the web, but through regulatory capture to sell off public lands and other resources currently owned by everyone:

https://www.snopes.com/fact-check/elon-musk-stop-donating-wi...

Which means that this is really a cultural issue, so many of us can't see the problems or solutions without challenging our most closely-held beliefs, which creates cognitive dissonance. So even though the fixes appear obvious, they are effectively out of reach for the foreseeable future because it's easier to sabotage the system than reform it.

None of this helps you immediately though. You might be able to move from Cloudflare to a free and open source alternative like CloudFIRE, although it looks like they are copying many of its same mistakes, for example "fake browser detection and blocking" which is at the top of their list of priorities:

https://github.com/coinkite/cloudfire

I'm having trouble finding other alternatives:

https://news.ycombinator.com/item?id=34800182

So this is what I mean. If you are really interested in empowering large groups of people with free access to information, then you will be running up against the full might and momentum of the status quo.

Something that gives me hope is that most hackers and makers were originally drawn to tech as a lifeline out of subjugation doing mundane and pointless work. Tech is inherently antiauthoritarian. So all it would take is a single wealthy individual, a single internet lottery winner, to fund efforts to reevaluate what underpins the status quo from first principles. It might not take much to deliver tech which can't be unseen, which routes around artificial scarcity. We can imagine providing resources through automation, outside of any profit motive. Until then, large groups of individuals will have to keep contributing to these efforts on their own dime at a snail's pace, with what little motivation they have left after working their lives away to make rent and enrich the already wealthy.

Apologies for the wall of text, but it's the holidays so why not.

shishy · 2024-12-26T19:58:31 1735243111

There are other preprint servers. But to your question, there are centralized indices that track all papers.

DOI is the primary identifier and preprints are also issuing them now.

Crossref has papers by DOI. OpenAlex and SemanticScholar also have records, with different id types supported (doi, pmid, etc).

immibis · 2024-12-26T15:53:45 1735228425

There's always [redacted due to copyright infringement policy].se?

swyx · 2024-12-25T19:33:59 1735155239

1. why mixbread's model?

2. how much efficiency gain did you see binarising embeddings/using hamming distance?

3. why milvus over other vector stores?

4. did you automate the weekly metadata pull? just a simple cron job? anything else you need orchestrated?

user thoughts on searching for "transformers on byte level not token level" - was good but didnt turn up https://arxiv.org/abs/2412.09871 <- which is more recent, more people might want

also you might want more result density - so perhaps a UI option to collapse the abstracts and display more in the first glance.

Quizzical4230 · 2024-12-26T05:06:29 1735189589

1. The model size was small enough to process the corpus fast-ish using the limited resources I have. They also support MRL and binary embeddings which help would be helpful in case I need to downsize on the VM size.

2. Close to 500ms. See [^1].

3. This [^2] was the reason I went with milvus. I also assumed that more stars would result in a bigger community and hence faster bug discovery and fixes. And better feature support.

4. Yes, I automated the weekly pull here [^3]. Since I am constrained on resources available, I used HuggingFace Spaces to do the automation for me :) Although, the space keeps sleeping and to avoid that, I am planning keep calling the same space using api/gradio_client. Let's see how that goes.

| which is more recent, more people might want

Absolutely agree. I am planning to add a 'Recency' sorting option for the same. It should balance between similarity and the date published.

| also you might want more result density - so perhaps a UI option to collapse the abstracts and display more in the first glance.

Oh, I will surely look into it. Thank you so much for a detailed response. :D

[1]: https://news.ycombinator.com/item?id=42507116#42509636 [2]: https://benchmark.vectorview.ai/vectordbs.html [3]: https://huggingface.co/spaces/bluuebunny/update_arxiv_embedd...

swyx · 2024-12-26T06:18:32 1735193912

my pleasure, thank you for the reply! ive never used milvus or heard of mixbread so this was refreshing.

curious_cat_163 · 2024-12-26T15:32:58 1735227178

This is great! I just tried some queries and the results were pretty decent, in terms of semantics. But, just thinking of it as a user, if this were to be part of my daily workflow (instead of say something like Google Scholar), I would like:

1. The option to somehow see _how_ the paper was reviewed and/or cited, if at all. There are things like OpenReview, see example [1]

2. The ability to "tell me a story to get up to speed" about a collection of papers. Generative models could help here -- but essentially, I want this thing to be able to write a paragraph for what one might find in the literature review / related work of a paper, with citations. :-)

All the best!

[1] https://openreview.net/forum?id=jhKbnNhwhc

Quizzical4230 · 2024-12-26T16:37:37 1735231057

1. I was not aware of OpenReview. I love the transparency and would definitely look into integrating it.

2. This is good feedback, making models write the Introduction section! I was planning to keep this search engine a little more traditional, however if the results are good, then it should be the way forward.

Thank you, Happy Holidays! :D

odyssey7 · 2024-12-26T17:58:15 1735235895

I have to second the idea, having hacked together something similar myself, to help me complete a literature review——a literature review that I wasn’t planning to publish. Simply generating summaries or pulling key quotes, paper by paper, wasn’t sufficient to be able to understand the topic in the way I wanted to for writing the literature review. In the end, the system would process a collection of hundreds of PDFs that might be related, generate summaries of what they mentioned about the topic in question, and, importantly, was also prompted to note anything about how the insights built upon or were related to insights from previous research, and the motivations behind developing that insight / the challenge it was attempting to solve and whether it was successful. This worked well enough to reduce what might have been weeks worth of work to just a few hours. Genuinely, I believe that research in the near future could look a lot different from what it looks like today.

fasa99 · 2024-12-25T22:19:52 1735165192

For what it's worth, back in the day (a few years ago, before the LLM boom a few years) I found on a similar sized vector database (gensim / doc2vec), it's possible to just brute force a vector search e.g. with SSE or AVX type instructions. You can code it in C and have a python API. Your data appears to be a few gigs so that's feasible for realtime CPU brute force, <200 ms

Quizzical4230 · 2024-12-26T05:09:45 1735189785

This is an interesting problem to tackle. Added to TODO list! :D

dmezzetti · 2024-12-25T14:18:50 1735136330

Excellent project.

As mentioned in another comment, I've put together an embeddings database using the arxiv dataset (https://huggingface.co/NeuML/txtai-arxiv) recently.

For those interested in the literature search space, a couple other projects I've worked on that may be of interest.

annotateai (https://github.com/neuml/annotateai) - Annotates papers with LLMs. Supports searching the arxiv database mentioned above.

paperai (https://github.com/neuml/paperai) - Semantic search and workflows for medical/scientific papers. Built on txtai (https://github.com/neuml/txtai)

paperetl (https://github.com/neuml/paperetl) - ETL processes for medical and scientific papers. Supports full PDF docs.

Quizzical4230 · 2024-12-25T15:24:16 1735140256

Thank you for your kind words.

These look like great projects, I will surely check them out :D

shishy · 2024-12-25T14:29:49 1735136989

paperetl is cool, saving that for later, nice! did something similar in-house with grobid in the past (great project by patrice).

dmezzetti · 2024-12-25T14:38:17 1735137497

Grobid is great. paperetl is the workhorse of the projects mentioned above. Good ole programming and multiprocessing to churn through data.

underlines · 2024-12-26T07:03:22 1735196602

hint: 8 days ago txtai released their arxiv embeddings

https://huggingface.co/NeuML/txtai-arxiv

Quizzical4230 · 2024-12-26T19:02:29 1735239749

omarhaneef · 2024-12-25T15:28:57 1735140537

For every application of semantic search, I’d love to see what the benefit is over text search. If there a benchmark to see if it improves the search. Subjectively, did you find it surfaced new papers? Is this more useful in certain domains?

Quizzical4230 · 2024-12-25T15:51:07 1735141867

All benefits depend on the ability of the embedding model. Semantic embeddings understand nuances, so they can match abstracts that align conceptually even if no exact keywords overlap. For example, "neural networks" vs. "deep learning." can and should fetch similar papers.

Subjectively, yes. I sent this around my peers and they said it helped them find new authors/papers in the field while preparing their manuscripts.

| Is this more useful in certain domains?

I don't think I have the capacity to comment on this.

feznyng · 2024-12-25T19:31:39 1735155099

One of the factors is how users phrase their queries. On some level people are used to full text search but semantic shines when they ask literal questions with terminology that may not match the answer.

Quizzical4230 · 2024-12-26T05:14:07 1735190047

Exactly. Full text paradigm has it's own pros and I believe we need those tools in the new vector search to take full advantage. I am planning to add keywords feature where if a user enters something in "quotes", the would need to be in the shown results. Just like you can do with a google search.

feznyng · 2024-12-26T13:43:22 1735220602

You might be interested in hybrid search which issues both a full text and semantic search and then merges the results via reciprocal rank fusion.

Quizzical4230 · 2024-12-26T14:24:10 1735223050

Thank you! I shall play with it this weekend :D

woodson · 2024-12-25T21:39:25 1735162765

Query keyword expansion works quite well for that without semantic search (although it can reduce precision).

namanyayg · 2024-12-25T14:30:59 1735137059

What are other good areas where semantic search can be useful? I've been toying with the idea for a while to play around and make such a webapp.

Some of the current ideas I had:

1. Online ads search for marketers: embed and index video + image ads, allow natural language search to find marketing inspiration. 2. Multi e-commerce platform search for shopping: find products across Sephora, zara, h&m, etc.

I don't know if either are good enough business problems worth solving tho.

bubaumba · 2024-12-25T14:45:21 1735137921

3. Quick lookup into internal documents. Almost any company needs it. Navigating file-system like hierarchy is slow and limited. That was old way.

4. Quick lookup into the code to find relevant parts even when the wording in comments is different.

imadethis · 2024-12-25T17:11:31 1735146691

For 4, it would be neat to first pass each block of code (function or class or whatever) through an llm to extract meaning, and then embed some combination of llm parsed meaning, docstring and comments, and function name. Then do semantic search against that.

That way you’d cover what the human thinks the block is for vs what an LLM “thinks” it’s for. Should cover some amount of drift in names and comments that any codebase sees.

jondwillis · 2024-12-25T20:41:10 1735159270

Please stop making ad tech better. Someone else might, but you don’t have to.

shigeru94 · 2024-12-25T09:58:31 1735120711

Is this similar to https://www.semanticscholar.org (from Allen Institute for AI) ?

triilman · 2024-12-25T11:14:18 1735125258

I think more like this website https://arxivxplorer.com/

Quizzical4230 · 2024-12-25T14:46:44 1735138004

It is more like what triilman commented, but with all components open-source. I plan to add filters soon enough with keywords support! (actually waiting for milvus)

zzyzek · 2024-12-26T19:41:15 1735242075

This seems like a cool idea, thanks for creating it!

Some feedback:

I tried searching for "wave function collapse algorithm", "gumin wave function collapse", "wfc" and "model synthesis" without any relevant hits to the area of research I was interested in. I got a lot of quantum computing and other physics related papers.

The "WFC algorithm" overloaded the term (and has nothing to do with quantum mechanics) so it's kind of a bad case for this type of search. Model synthesis is way too generic, so again, might be a bad case for this.

The first page of results using "wave function collapse algorithm" from arXiv itself gives relevant results.

Quizzical4230 · 2024-12-27T02:43:16 1735267396

Thank you for taking the time to try out the site!

arXiv has a keyword based search engine. It looks for words as is in the text. PaperMatch tries to find similar papers that are closer in meaning.

Here is an alternative approach: Take one paper that you like, copy the abstract from arXiv (or arXiv ID) and paste it in PaperMatch. This should help you find similar papers.

zzyzek · 2024-12-27T17:55:33 1735322133

Very nice! Putting in an arXiv ID looks to produce many results that are much more relevant.

EDIT: You should provide this in an "information"/"about"/"how to use" dialogue or page to help people use the tool better.

Quizzical4230 · 2024-12-28T05:23:25 1735363405

Thank you!

I agree, since this site has the same interface, people expect it to work the same way. Which I was going for but didn't realise the cons of it. I will add an about section!

kouteiheika · 2024-12-26T18:23:12 1735237392

Feedback: first thing I tried is searching for "leaky relu" and I got a bunch of results related to fluids, which is... not very relevant. (:

Compare that to scholar which returns all relevant results:

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=leak...

You might want to retrain/finetune your own embedding model instead of using a general-purpose one.

Quizzical4230 · 2024-12-26T18:49:47 1735238987

Thank you for taking the time to try out the site!

Google scholar scholar is a keyword based search engine. It looks for words as is in the text. PaperMatch tries to find similar papers that are closer in meaning.

Here is an alternative approach: Take one paper that you like, copy the abstract from Google Scholar and paste it in PaperMatch. This should help you find similar papers.

lgas · 2024-12-25T10:42:12 1735123332

This might've saved you some time: https://huggingface.co/NeuML/txtai-arxiv

cluckindan · 2024-12-25T13:39:15 1735133955

The dataset there is almost a year old.

dmezzetti · 2024-12-25T14:13:20 1735136000

It was just updated last week. The dataset page on HF only has the scripts, the raw data resides over on Kaggle.

Quizzical4230 · 2024-12-25T15:29:36 1735140576

Actually, yeah XD

serial_dev · 2024-12-26T07:37:08 1735198628

I tried a simple search by author and it didn’t work. All the fancy stuff is great, but I’d expect the basics still work, in the end it’s a search engine for papers.

wodenokoto · 2024-12-26T08:43:34 1735202614

Maybe use the right tool for the job? Author names generally don’t have a lot of semantics associated with them and definitely not in the abstract.

Maro · 2024-12-25T18:45:18 1735152318

Very cool!

Add a "similar papers" link to each paper, that will make this the obvious way to discover topics by clicking along the similar papers.

Quizzical4230 · 2024-12-26T05:10:28 1735189828

Amazing! I will do so :D

mskar · 2024-12-25T16:10:43 1735143043

This is awesome! If you’re interested, you could add a search tool client for your backend in paper-qa (https://github.com/Future-House/paper-qa). Then paper-qa users would be able to use your semantic search as part of its workflow.

OutOfHere · 2024-12-26T14:37:41 1735223861

I advise against it since binarized hamming distance isn't exactly that good unless your vector length is say a million.

Quizzical4230 · 2024-12-27T12:34:36 1735302876

I have the fp32 embeddings saved. It is for the website that I use binarised ones to combat latency.

Quizzical4230 · 2024-12-25T16:14:30 1735143270

paper-qa looks pretty cool. I will do so!

higty · 2024-12-28T01:04:07 1735347847

It sounds nice. How do you evaluate the performance of your way against usual embedding?

Quizzical4230 · 2024-12-30T04:40:25 1735533625

By assuming "usual embedding" meaning using the default model, which generally is "all-MiniLM-L6-v2", I used MixedBread's embedding model because of this [^1].

You can evaluate how well a model is doing by subjectively going through some search results for papers you have a good grasp on. Another way I look at is to see the 2D "maps" of the embeddings and how well these are segregated, see [^2].

[1]: https://www.mixedbread.ai/blog/binary-mrl [2]: https://raw.githubusercontent.com/mitanshu7/dumpyard/refs/he...

OutOfHere · 2024-12-26T14:58:53 1735225133

Instead of using binarized hamming, why not just use a shorter embedding that you can properly tackle? What good is Milvus if it's not giving you matches using something more proper?

Also, this site is not Reddit. You don't have to reply to every comment.

Quizzical4230 · 2024-12-26T19:12:48 1735240368

> Also, this site is not Reddit. You don't have to reply to every comment.

I am so conflicted whether to reply to this comment or not Xp

Jokes apart, Mxbai model + Milvus gives fantastic results in fp32, however it's the latency that is an issue here. I could try chopping the fp32 vectors in half without binarizing to see. Thanks!

madbutcode · 2024-12-25T17:07:29 1735146449

This looks great! I have used the biorXiv version of papermatch and it gives pretty good results!

Quizzical4230 · 2024-12-26T03:19:09 1735183149

Thank you for your kind words!

zerop · 2024-12-26T12:13:50 1735215230

This looks great, thanks for building this.

Something on similar lines which many may link, Research Rabbit - https://www.researchrabbit.ai/

Quizzical4230 · 2024-12-26T19:08:10 1735240090

I am glad you liked it!

I wanted PaperMatch to be open-source so that the users can understand the workflow behind it and hack it to their advantage instead of grumbling away when the results aren't to their liking.

mrjay42 · 2024-12-25T14:00:35 1735135235

I think you have an encoding problem <3

If you search for "UPC high performance computing evaluation", you'll see paper with buggy characters in the authors name (second results with that search).

Quizzical4230 · 2024-12-25T14:52:38 1735138358

Most definitely. Thank you for pointing this out!

bubaumba · 2024-12-25T14:12:19 1735135939

This is cool, but how about local semantic search through tens of thousands articles and books. Sure I'm not the first, there should be some tools already.

Quizzical4230 · 2024-12-25T15:11:46 1735139506

I definitely was thinking about something like this for PaperMatch itself. Where anyone can pull a docker image and search through the articles locally! Do you think this idea is worthwhile pursuing?

bubaumba · 2024-12-25T16:14:17 1735143257

Absolutely worth doing. Here is interesting related video, local RAG:

https://www.youtube.com/watch?v=bq1Plo2RhYI

I'm not an expert, but I'll do it for learning. Then open source if it works. As far as I understand this approach requires a vector database and LLM which doesn't have to be big. Technically it can be implemented as local web server. Should be easy to use, just type and get a sorted by relevance list.

Quizzical4230 · 2024-12-25T16:20:31 1735143631

Perfect!

Although, atm I am only using retrieval without any LLM involved. Might try integrating if it significantly improves UX without compromising speeds.

ttpphd · 2024-12-26T00:18:02 1735172282

Try Semantra https://github.com/freedmand/semantra

tokai · 2024-12-25T15:23:20 1735140200

Nice but I have to point out that a systematic review cannot be done with semantic search and should never be done in a preprint collection.

dmezzetti · 2024-12-25T15:30:48 1735140648

Quizzical4230 · 2024-12-25T15:58:39 1735142319

Not sure about the semantic search, but preprints are peer reviewed and hence not vetted. However, at the current pace of papers on arXiv (5k+/week) peer review alone might halt the progress.

OutOfHere · 2024-12-26T14:39:56 1735223996

You mean to say that preprints are not peer reviewed.

dmezzetti · 2024-12-25T16:03:15 1735142595

Why not semantic search was the bigger question.

WolfOliver · 2024-12-26T12:13:20 1735215200

but it can provide recommendations

Quizzical4230 · 2024-12-25T15:25:56 1735140356

Agreed.

antman · 2024-12-25T14:05:45 1735135545

Nice work. Any other technical comments, why did you use those embeddings, did you binarzue them, did you use any dpecial prompts?

Quizzical4230 · 2024-12-25T15:19:00 1735139940

At the beginning of the project, MixedBread's embedding model was small and leading the MTEB leaderboard [^1], hence I went with it.

Yes, I did binarize them for a faster search experience. However, I think the search quality degrades significantly after the first 10 results, which are same as fp32 search but with a shuffled order. I am planning to add a reranking strategy to boost better results upwards.

At the moment, this is plain search with no special prompts.

[1]: https://huggingface.co/spaces/mteb/leaderboard

andai · 2024-12-25T16:36:53 1735144613

Did you notice a difference in performance after binarization? Do you have a way to measure performance?

Quizzical4230 · 2024-12-25T16:46:45 1735145205

Absolutely!

Here is a graph showing the difference. [^1]

Known ID is arXiv ID that is in the vector database, Unknown IDs need the metadata to be fetched via API. Text is embedded via the model's API.

FLAT and IVF_FLAT are different indexes used for the search. [^2]

[1]: https://raw.githubusercontent.com/mitanshu7/dumpyard/refs/he...

[2]: https://zilliz.com/learn/how-to-pick-a-vector-index-in-milvu...

binarymax · 2024-12-25T16:53:07 1735145587

That looks great for speed, but what about recall?

Quizzical4230 · 2024-12-25T17:08:14 1735146494

That's has a major downgrade. For binary embeddings, the top 10 results are same as fp32, albeit shuffled. However after the 10th result, I think quality degrades quite a bit. I was planning to add a reranking strategy for binary embeddings. What do you think?

amitness · 2024-12-26T09:08:25 1735204105

Try this trick that I learned from Cohere: - Fetch top 10*k (i.e. 100) results using the hamming distance - Rerank by taking dot product between query embedding (full precision) and binary doc embeddings - Show top-10 results after re-ranking

Quizzical4230 · 2024-12-26T13:13:41 1735218821

This is pretty cool. The dot product would give the unnormalized cosine similarity from a smaller pool. Thank you so much!

intalentive · 2024-12-25T18:21:39 1735150899

Recommend reranking. You basically get full resolution performance for a negligible latency hit. (Unless you need to make two network calls…)

MixedBread supports matryoshka embeddings too so that’s another option to explore on the latency-recall curve.

Quizzical4230 · 2024-12-26T05:15:58 1735190158

> Recommend reranking.

Will explore it thoroughly then!

> MixedBread supports matryoshka embeddings too so that’s another option to explore on the latency-recall curve.

Yes, exactly why I went with this model!

maCDzP · 2024-12-25T21:27:44 1735162064

I want to crawl and plug in scihib to this and see what happens.

gaborme · 2024-12-25T15:36:29 1735140989

Nice. Why not use a full-text search like self-hosted Typesense?

Quizzical4230 · 2024-12-25T15:44:23 1735141463

Full text search would be redundant as arXiv.org already supports it. For semantic search, Typesense has limited collection of embedding models. [^1]

[1]: https://huggingface.co/typesense/models/tree/main

cryptonector · 2024-12-28T21:39:37 1735421977

This is really awesome. Thank you!

Quizzical4230 · 2024-12-29T17:10:09 1735492209

I am glad you liked it! <3

amelius · 2024-12-26T01:43:50 1735177430

Great procrastination project :)

Quizzical4230 · 2024-12-26T05:10:42 1735189842

hey hey hey! XD

ukuina · 2024-12-25T15:07:41 1735139261

Related: emergentmind.com

Quizzical4230 · 2024-12-25T15:14:58 1735139698

Thank you for the link. Would you know any reliable small model to add on top of vanilla search for a similar experience?

venice_benice · 2024-12-26T01:58:39 1735178319

interesting project; I’m not really sure how useful it is for field-specific stuff—I'm searching for “image reduction astronomy”, and it shows all sorts of related but not image-reduction work (including noise reduction which is not the same thing). I’m not really familiar with vector search enough to evaluate it well enough.

However I can give you the heads-up that the abstracts don't render well because (La)TeX is interpreted as markdown so that

    Paper~1 shows something and Paper~2 shows something else

will strikethrough the text between the tildes (whereas they are meant to be non-breaking spaces). Similarly for the backtick which makes text monospaced in the rendered output but is simply supposed to be the opening quote.

Quizzical4230 · 2024-12-26T05:24:50 1735190690

Yes, I think vector search is tricky to navigate at times since now the onus is on the user to explain the problem well. However, you can copy paste full abstracts to get similar papers well enough.

I will fix the LaTeX rendering ASAP.

Thank you for trying out the site! Happy Holidays :D

ProofHouse · 2024-12-26T18:28:46 1735237726

I couuld and really use this, but it didn't work for me. And HAS to have a date filter. That is a must maybe with some time based pre-option defaults like HackerNews. Good luck, want to try again when it works. Good idea

Quizzical4230 · 2024-12-26T19:03:59 1735239839

They are definitely planned to be integrated very soon! I probably should have waited to post on HN untill that. I will ping you once the features are live.

Thanks for trying out the site!