Show HN: I made a search engine for Hacker News

209 points by ofermend 3 days ago | 80 comments

I love HN but always felt the search with algolia is okay but does have some limitations. Since I work at Vectara I decided to try and create a better search for HN. It's based on data from roughly the last 6 months of HN stories and comments.

Would love to hear feedback and how useful this is relative to the existing search.

wjb3 3 days ago |
Excellent!
metadat 3 days ago |
Cool project, but I'm struggling to understand what is better about the Vectara solution?
Compared to Algolia.hn, this gives 0 filter controls (time window, stories vs. comments, `author:metadat', sort order, and so on), and no ability to search for exact matches. It failed to turn up anything interesting or even relevant for the 4 or 5 queries I ran.
You've still made it further than I in the HN search engine adventures, which is commendable.
mikepurvis 3 days ago |
If it's searching article content as well as just titles/comments, that's already quite interesting and useful. I've long wanted something like that for my browser history, like a "help me find this article I know I read sometime last week about X."
metadat 3 days ago |
Yes, I've only recently rediscovered using the browser history for this exact purpose and it hasn't failed me yet.
It would be remarkable and interesting to have a super deep search capability that indexes all first-order links on this site.
nashashmi 3 days ago |
browser history seems to only save the last 3 months of history. It used to be many years of data but that also slowed the browser down.
yaj54 3 days ago |
And you think that hasn't been made yet? :-D
Show HN: DeepHN - https://news.ycombinator.com/item?id=26791582
Of note, I was able to find that item (which I recalled existed but not by what name) with hn.algolia, but I was not able to find it with the OP search engine, or with the DeepHN search itself. So in my book, Algolia is winning.
But projects like this are super fun and educational to build so props to OP.
metadat 3 days ago |
OP said they only indexed the past 6 months.
seaal 3 days ago |
Perhaps someone will be able to help me find a website I’ve been trying to Google for with no luck: the main focus of the website is a music player with 3D pipes and arrows that the camera would follow along, live rendering the entire scene while music was playing, but it was also a personal blog if you scrolled down. Domain was relatively short and I think it started with K.
ofermend 3 days ago |
Currently it's indexing the content of stories and comments. Are you suggesting also to index the content of the main story link (which is outside HN)?
squigz 3 days ago |
That is what they're suggesting, yes.
cottsak 3 days ago |
maybe add metadata to the index, but not the full page content, else you're just gonna build another google
metadat 3 days ago |
How so? There's no PageRank in sight here, very little spam to worry about or contend with.
fluential 3 days ago |
This already exist: "searching over the titles, text and URLs of sites you've visited before" -- https://www.browserparrot.com/
kid64 3 days ago |
Mac-only
yaj54 3 days ago |
Ditto. I somewhat recently learned that Safari only keeps the history log for one year unless you explicitly set "Remove history items" to "manually." Which I've done so that I at least have a list that I could crawl in the future to build a full text index on.
squigz 3 days ago |
I'm fairly confident every single major browser (and probably most minor) have a history limit of some sort. Sometimes it's age, sometimes it's # of entries.
metadat 3 days ago |
Usually I'm only looking back the past week or three.
Otherwise I snapshot the page to evernote when I think it might be interesting later. Hopefully they don't completely end the free tier.
tjlav5 3 days ago |
I think Microsoft is working on that right now :p
rovr138 3 days ago |
That’s exactly what it sounded like to me.
Of course, local/offline only would be great.
shortrounddev2 3 days ago |
Lol you're describing Microsoft's recall feature that everyone here is saying is the death of privacy
kelnos 3 days ago |
If Recall was open source and auditable, and didn't send any data at all about usage to the internet, I guess that would be fine.
Otherwise, no thanks.
tamimio 3 days ago |
>algolia.hn
Is that a valid link? I get an error when opening it.
miloignis 3 days ago |
It's the other way around: https://hn.algolia.com/
gnabgib 3 days ago |
No that's not valid (it's attached to the search box at the bottom of the page) https://hn.algolia.com/?q=
sattoshi 3 days ago |
They meant https://hn.algolia.com/
potatoman22 3 days ago |
You can also use !hn on duckduckgo
owenpalmer 3 days ago |
Cool, I like it!
I found a bug. Under the "When will GPT-5 be released?" search results, there are double duplicate results. On one of the duplicates, the "username (date)" says "undefined (undefined)"
mewpmewp2 3 days ago |
I tried this query as well, I was a bit frightened when I saw my own comment in the results.
ofermend 3 days ago |
Good find. let me check why that occurs.
ofermend 2 days ago |
This should be fixed now. Thanks for the find.
owenpalmer 16 minutes ago |
Happy to help
pedalpete 3 days ago |
Nice work. I wonder if there may be a better application for the Vectara capabilities than search?
Algolia has already done the search thing, can the Vectara search be 10x better?
What I do find missing from HN is the ability for me to see things that may be of interest to me, but that I may have missed. I like how I get everything in the main feed which is pure popularity, but I don't have the time to go through all posts, and definitely likely miss things I would probably have been interested in.
Though this can be done with collaborative filtering, or other non-AI methods, might this be a decent use case for your AI?
whiplash451 2 days ago |
Unclear that the HN feed is pure popularity. There's got to be something else to it [1], otherwise it would look like all the medium-like crap you'll find elsewhere.
[1] my hunch is that some human expert curation is involved.
pedalpete 2 days ago |
It is not pure popularity, I suspect it is a combination of clicks/upvotes/comments over time. I know time is a key component in the algorithm. If you don't rise fast, you don't rise. Who is commenting/upvoting is also likely an easy metric to add in. If I have more points, my votes are probably worth more than someone without.
Human curation also exists, but I think that is aimed at removing spam and uplifting YC company posts.
flir 2 days ago |
I've been wondering if segmentation might be the way to go. Have the chatbot look at all the items, cluster them into a few buckets of its choosing, then throw each new item into the most appropriate bucket.
(I've been thinking about this not just in terms of HN, but treating all my RSS feeds as one undifferentiated stream and just having a chatbot sort incoming items into whatever bucket it deems most appropriate).
What's stopping me is that it might work, and I doubt making the internet even stickier is good for me long term.
8organicbits 2 days ago |
I've been thinking about that at length recently. RSS feeds often don't have category elements specified and there isn't a widely used taxonomy of category names. I'd prefer not to use AI to solve the problem, although encouraging the use of RSS categories will be slow work.
https://alexsci.com/blog/rss-categories/
flir 2 days ago |
Thanks for the link. It's interesting, and I hope you find a way forward with that, it would undoubtedly be a useful addition to the ecosystem.
But my gut feeling is that there's not enough interest in RSS right now to drive widespread adoption of a new version of the spec. My approach would be to focus on improved UX over existing feeds, rather than speculatively expanding the spec to make feeds richer.
The main advantage of my approach, I think, is that it adapts to the individual end user's needs. If all my subscribed feeds are tech-focused and I use a generic published taxonomy, I'm going to end up with 60% of my items in "Technology" and 30% in "Computing". If I use a chatbot to dynamically bucket stuff, I'll get "Micro PCs", "Graph theory", "Golang", etc etc.
8organicbits 2 days ago |
One nit, RSS categories have long been part of the spec, but I've found people don't add them consistently.
8organicbits 2 days ago |
This may be a good use case for RSS. Feed readers can filter posts by keyword so you can take the unofficial RSS feed [1] and filter it down by your interests.
I posted an RSS reader that can do this recently [2] and I'm actively hacking on another [3]. But there's many RSS tools that can do this.
[1] https://hnrss.github.io/
[2] https://news.ycombinator.com/item?id=40839262
[3] https://github.com/ralexander-phi/feed2pages-action
noman-land 2 days ago |
I would love to be able to feed my upvotes and maybe even comments into an LLM and receive search results ordered by relevancy to my interests.
smusamashah 3 days ago |
It doesn't always work correctly. For example "Text to diagram tool" is returning very few results and some of the results are not even correct. While this topic has been discussed a lot here. I was mainly looking for the list of tools I keep sharing whenever this topic comes up, or when I share a related tool in a thread.
yoouareperfect 3 days ago |
Amazing! Congrats on launching!
marcodiego 3 days ago |
Please, add also the possibility to search in links posted/commented in hacker news. I bet it would be competitive against google for the hn crowd.
ofermend 2 days ago |
For sure. Will test that.
ceving 3 days ago |
Searching for "iptables" returns first:
> Arm says it wants all Snapdragon X Elite laptops destroyed
Not so useful.
codetrotter 3 days ago |
It’s matching two comments in the thread. It bolds the part that talks about iptables in each.
So it’s not like it’s irrelevant, even though it is certainly not actually the most relevant one either.
It seems to give better results if you are more specific. For example, try the following search:
how to use iptables effectively
And have a look at the first five or so results.
Also, note that OP said it’s searching about six months worth of data. So if anything specific about iptables that you were looking for is older than that then their search tool doesn’t know about it.
ofermend 2 days ago |
Yes, it's just about 6 months back. If requested by folks here, we can certainly crawl back more years - this was just the first crawl I did.
codetrotter 2 days ago |
Are you crawling the HN via the normal website?
It’s better to use the API.
https://github.com/HackerNews/API
dcoder2311 3 days ago |
Cool work!
call-me-al 3 days ago |
really like this one from [Show HN: Hacker Search – A semantic search engine for Hacker News](https://news.ycombinator.com/item?id=40238509). The URL is: https://hackersearch.net/ask
ravishing0223 3 days ago |
Nice.
https://hackernews.demo.vectara.com/?query=I+made+a+search+e...
yanko 3 days ago |
How to full text search on hn given user favorites only... If no such such option I feel disappointed
Jiahang 3 days ago |
i use Google ：｛｝+ news.ycombinator.com
n4r9 2 days ago |
Like some other comments here I find HN search useful and powerful and am a little unsure what the added value is here. Possibly/probably it's for people that search in a different way to me.
One of the most frequent searches I do is to look for a specific comment that I know a user made recently. For example, I might want to look for my own comment here: https://news.ycombinator.com/item?id=40801389 (sorry, this is a slightly political one but I just picked it randomly for test purposes).
Searching Vectara for "n4r9 NHS" produces no results: https://hackernews.demo.vectara.com/?query=n4r9+NHS&filter=
HN's own search however produces the goods in the top result: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
[ EDIT except for this very post :p ]
Maybe 6 days ago is outside the dataset that this is based on?
Some other thoughts/suggestions:
- Ability to click through to the comment itself? At the moment it looks like the link goes just to the main comments page and then I have to find the relevant comment on the page.
- Filter comments vs posts?
- Order by datetime?
- Filter within a date range?
ofermend 2 days ago |
Thank you - these are great suggestions. Will work to add these...
omneity 2 days ago |
I searched for "Supabase" and none of top the results in your demo contained an actual post about supabase. Following the example queries I then tried "What is supabase?" and the results were equally irrelevant.
My personal opinion is that I'll keep using the HN search for the foreseeable time.
zmccormick7 2 days ago |
I had the same issue when searching for specific companies/products. It feels like a pretty basic vector search with no hybrid search component or reranking.
bschwarz 2 days ago |
Struggled to find a specific comment earlier today via Algolia, found it as the second result on Vectara.
d4rkp4ttern 2 days ago |
I really like this HackerSearch app:
https://hackersearch.net/
d4rkp4ttern 2 days ago |
Previously discussed here:
https://news.ycombinator.com/item?id=40238509
sharpshadow 2 days ago |
Like Algolia it doesn’t work without JS.
bckr 2 days ago |
Next someone should make a hackernews meta search engine :)
ptsd_dalmatian 2 days ago |
Am I the only one that hasn't noticed the search input in the footer? :O
sph 2 days ago |
These days I use the search feature much more than commenting or reading posts. The frontpage is the usual recent news addiction treadmill, while for research into niche topics you can find a treasure trove of interesting comments in the archives.
Want more posts about Lisp, Smalltalk and reverse engineering, for example, rather than the usual front page drivel? Search for them.
On one hand I wish Algolia didn't give very old posts a lot of weight (it often prefers to show posts > 8+ years ago), on the other hand old content tends to be before the Eternal September of tech-adjacent people coming to this forum to discuss tech-adjacent light content, so it's actually a feature. The real value of HN is its archives IMO.
satvikpendem 2 days ago |
I also made a bookmarklet to show me HN posts from random dates, it is quite interesting to see what was interesting to people a decade ago, for example. Lots of comments in HN's heyday were pretty eye opening as well.
javascript:(function() {function randomDate(start, end) {var date = new Date(+start + Math.random() \* (end - start));var day = ("0" + date.getDate()).slice(-2);var month = ("0" + (date.getMonth() + 1)).slice(-2);var year = date.getFullYear();return year + '-' + month + '-' + day;}var startDate = new Date(2007, 9, 9);var endDate = new Date();var randomDateStr = randomDate(startDate, endDate);var newUrl = 'https://news.ycombinator.com/front?day=' + randomDateStr;window.location.href = newUrl;})();
djeastm 2 days ago |
That's really neat. Thanks for providing it
dang 2 days ago |
Also recommended is https://news.ycombinator.com/front, which shows you the frontpage stories from any day—a little bit like archive.org would do, except that it's not a snapshot, but a composite of all the front pages from a 24 hour period.
For example, here's HN from a year ago: https://news.ycombinator.com/front?day=2023-07-02.
https://news.ycombinator.com/highlights is another good resource (and if anyone notices a great HN comment, past or present, they're welcome to nominate it for the highlights list! just email [email protected]).
satvikpendem 2 days ago |
Thanks Dan, that is in fact exactly what my bookmarklet does, it generates a random date (2012-01-02) and appends it to `https://news.ycombinator.com/front?day=`. I think there should be a `random` link on the HN header links that does what I am currently doing, it would be useful to have it built-in.
dang 2 days ago |
Ah sorry I missed that you were already pointing to those!
A 'random' link might be a good idea. For /highlights too.
satvikpendem a day ago |
Yep it's fairly easy to make a bookmarklet for /front but for /highlights it seems to depend on the id of the comment as a cursor rather than the date, so it would be nice to have them both as links in the header bar.
BiteCode_dev 2 days ago |
Thanks, 10 years of HN and I was still using "site:news.ycombinator.com"
quenix 2 days ago |
Try https://hn.algolia.com — it’s great
dewey 2 days ago |
That's where the footer search field points.
jiehong 2 days ago |
Congratulations!
Although, something I value a lot from algolia is the very fast live search as you type[0].
Vectara seems to be smarter, but much slower.
My needs are satisfied with algolia 99% of the time as a technical user.
[0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
jpl56 2 days ago |
Great! My first thought was "funniest XKCD". Loved the Self Driving one [0]. Thanks!
[0]: https://xkcd.com/1897/
okhuman 2 days ago |
I'm thrilled to see Vectara here – they have one of the top Product Managers I've ever had the chance to work with and be mentored by.
uwemaurer 2 days ago |
Great work!
I am currently playing with the Algolia hackernews search API myself and experimenting with spaCy Named Entity Recognition and llama3 to come up with some interesting data.
Work in progress version here: https://news.facts.dev/topic
KomoD 2 days ago |
OP mentions Algolia having limitations but this seems more limited?
It doesn't seem like it has any filtering or sorting like the Algolia one has, like comments/stories by a specific user, during certain dates, sorting by upvotes/recency, searching by just title/content/comments.
Say I wanted to search for comments by the OP, ofermend, it doesn't seem like I can...
Entering just their name returns results that aren't made by them nor mention their username, I tried other queries too without any luck.
douglaskayama 2 days ago |
I tried searching for lootitooti, Vectara found nothing, Algolia found two results, one in a post and one in a comment in another post.
PS: no, lootitooti is not my project. I decided to finally watch Game of Thrones with my wife and I remembered that site when I was watching the opening. I remembered seeing it here on HN, searched and found it.
kuzej 2 days ago |
Great project! But I'm curious about how the frontend was implemented. What's the tech stack? For example, is it Next.js + Tailwind CSS?