Show HN: I created a PoC for live descriptions of the surroundings for the blind

73 points by o40 7 days ago | 25 comments

The difference in cost between products that are developed as accessibility tools compared to consumer products is huge. One example is camera glasses where the accessibility product costs ~$3000 (Envision Glasses), and the consumer product costs ~$300 (Ray-Ban Meta).

In this case the Ray-Ban Meta is getting accessibility features. The functionality is promising according to reviews, but requires the user to say "Hey meta what am I looking at" every time a scene is to be described. The battery life seem underwhelming as well.

It would be nice to have an cheap and open source alternative to the currently available products, where the user gets fed information rather than continuously requesting it. This is where I got interested to see if I could create a solution using an ESP32 WiFi camera, and learn some arduino development in the process.

I managed to create a solution where the camera connects to the phone "personal hotspot", and publishes an image every 7 seconds to an online server, which then uses the gpt-4o-mini model to describe the image and update a web page, that is read back to the user using voice synthesis. The latency for this is less than 2 seconds, and is generally faster.

I am happy with the result and learnt a lot, but I think I will pause this project for now. At least until some shiny new tech emerges (cheaper open source camera glasses).

three2a88 6 days ago |
au revoir
https://www.youtube.com/watch?v=Wuntz3KDIAk
smitty1e 6 days ago |
PoC means "point of care" in this context?
pockybum522 6 days ago |
Proof of Concept
smitty1e 6 days ago |
Thanks.
leshokunin 6 days ago |
Wonderful effort. Congrats and I hope this keeps developing forward.
miki123211 6 days ago |
Blind person here.
I don't see a point to this over just using a cell phone app to do this, which are slowly starting to appear.
o40 6 days ago |
Yes, apps for this is for sure the best solution. Hopefully something like "Be My AI" in combination with consumer products such as Ray-Ban Meta, where you can get descriptions without telling the world that you are requesting descriptions.
I have not done any app development, and for this project I wanted to keep some things simple to focus on what can be expected from a low quality camera in combination with AI for descriptions.
oulipo 6 days ago |
Hi! Could you tell me what are your favorite devices / apps to get descriptions of scenery? Are you a coder? Would you point me to the best setup for coding for a blind person? Thanks
rusty_venture 6 days ago |
Does it say "I am a lamp. I am a lamp."?
nels 6 days ago |
Nice work! You may be interested in a paper that explored a similar concept and included some interesting ways of dealing with latency called WorldScribe: https://worldscribe.org.
o40 6 days ago |
Very interesting. This in combination with something that "tracks" described object not needing to describe them again would be a game changer.
sajb 6 days ago |
"You are likely to be eaten by a grue."
rkagerer 6 days ago |
Interesting, I had no idea there were "Sight as a Service" offerings.
lionkor 6 days ago |
I love abbreviations! Is it a point of care? A piece of crap? A proof of concept? All of them would work :)
rad_gruchalski 5 days ago |
Proof of concept. Don’t pay at a POS.
MrVandemar 6 days ago |
Did you consult with your target audience — ie. blind or low-vision people — before or during development?
o40 6 days ago |
Yes. My partner is visually impaired so that is one of the reasons why I think this is interesting to investigate. The current solution is way to "janky" to actually use, but gives insight in the problem to solve.
My hope is that there will be "cheap" camera glasses that you can use different services for image descriptions. There is a company called "Be My Eyes" that is developing an AI tool for image descriptions, which probably is miles better than anything I can come up with. https://www.bemyeyes.com/blog/introducing-be-my-ai
Be My Eyes seem to support Ray-Ban Meta glasses, so hopefully "Be My AI" will too.
I understand the "not consulting the target audience" all too well, for instance braille signs that are at eye-level and is hard to find. Some workplaces is very keen to make accessibility adjustments, but mostly if they are seen so that they can show others that adjustments have been done, regardless if they actually help or not.
MrVandemar 6 days ago |
I commend you then. Accessibility is unfortunatley too often done without consultation with the community it is supposed to benefit.
biosboiii 6 days ago |
Check out my reverse-engineering/cracking of Microsoft's App just doing that, SeeingAI.
https://altayakkus.substack.com/p/you-wouldnt-download-an-ai
tr33house 6 days ago |
This was a great read. At this point, any org should assume on-device models are public
biosboiii 5 days ago |
Thanx! Yeah, they should :) Would love to do this with CoreML on Apple devices, but my newest iPhone is a 7.
But if you subscribe, you may see me doing the same with a surveillance camera soon(isch) :)
Someone 6 days ago |
> It would be nice to have an cheap and open source alternative to the currently available products, where the user gets fed information rather than continuously requesting it
I think you need to triple-check whether users actually find that nice.
Assuming that keeping the text limited to what interests the user will stay an unsolved problem for the foreseeable future, I guesspect that they prefer a middle ground where they aren’t continuously bombarded with text, but it’s easy to get that flow going. For example, having that text feed on only while a button is being held down.
I guesspect that because I think users would soon be fed up with an assistent that says there’s a painting on the wall or a church tower in the distance every time they turn their head.
Both can be useful information, but not when you hear them for the thousandth times while in your own house/garden.
o40 6 days ago |
Yes, repeated information is not great in many cases. A more advanced system could possibly keep track of which information is new and which information is already known to the user.
I wanted to create something opposite of needing to say "Hey Google, describe what is in front of me" or similar. Also a point was to see how cheap/simple you can go and still get valuable information.
xnx 6 days ago |
Neat. So this is like the free Google Lookout app but more emphasis on the scene than objects.
oniony 5 days ago |
I love how the descriptions after the prompt was fixed now read like the descriptions of the scenes in the 1982 video game The Hobbit.