I made Browser-Use, an open-source tool that lets (all Langchain supported) LLMs execute tasks directly in the browser just with function calling.
It allows you to build agents that interact with web elements using natural language prompts. We created a layer that simplifies website interaction for LLMs by extracting xPaths and interactive elements like buttons and input fields (and other fancy things). This enables you to design custom web automation and scraping functions without manual inspection through DevTools.
Hasn't this been done a lot of times? Good question, as a general SaaS tool yes, but I think a lot of people are going to try to make their own web automation agents from scratch, so the idea is to provide groundwork/library for the hard part so that not everyone has to repeat these steps:
- parse html in a LLM friendly way (clickable items + screenshots)
- provide a nice function calls for everything inside the browser
- create reusable agent classes
What this is NOT? An all knowing AI agent that can solve all your problems.
The vision: create repeatable tasks on the web just by prompting your agent and not care about the hows.
To better showcase the power of text extraction we made a few demos such as:
- Applying for multiple software engineering jobs in San Francisco
- Opening new tabs to search for images of Albert Einstein, Oprah Winfrey, and Steve Jobs
- Finding the cheapest one-way flight from London to Kyrgyzstan for December 25th
I’d be interested in feedback on how this tool fits into your automation workflows. Try it out and let me know how it performs on your end.
We are Gregor & Magnus and we built this in 5 days.
I see in the readme that it claims that it is MIT licensed, but there is no actual license file or information in any of the source files that I could find.
Let me know how it goes if you try it.
I'd give my interest in Hell for a way to have a script plug in data into a Java app.
But would be interesting to see what happenes with our pipeline with pure vision model. Did you mean something else?
a) There were a test / eval suite to determine which model works best for what. It could be divided into a training suite and test suite. (Training tasks can be used for training, test tasks only for evaluation.) Possibly a combination of unit tests against known xpaths, and integration tests that are multi-step and end in a measurable result. I know the web is constantly changing, so I'm not 100% sure how this should work.
b) There were some sort of wiki, or perhaps another repo or discussion board, of community-generated prompt recipes for particular actions.
B) so, shadcn for prompts for web agents haha :) but I agree, that would be SICK! Just go to browseruse and get the prompt for your specific use case
Actually, browser use works quite well with vision turned off, it just sometimes gets stuck at some trivial vision tasks. The interesting thing is that screenshot approach is often cheaper than cleaned up html, because some websites have HUGE action spaces.
We looked at some papers (like ferret ui) but i think we can do much better on html tasks. Also, there is a lot of space to improve the current pipeline.
The computer use stuff gets me fired up enough that I end up always sharing this, even though when delivered concisely without breaking NDAs, it can sound like a hot take:
The whole thing is a dead end.
I saw internal work at a FAANG on this for years, and even in the case where the demo is cooked up to "get everything right", intentionally, to figure out the value of investing in chasing this further...its undesirable, for design reasons.
It's easy to imagine being wow'd by the computer doing something itself, but when its us, its a boring and slow way to get things done thats scary to watch.
Even with the stilted 100% success rate, our meatbrains cheerily emulated knowing its < 100%, the fear is akin to watching a toddler a month into walking, except if the toddler had your credit card and a web browser and instructions to buy a ticket.
I humbly and strongly suggest to anyone interested in this space to work towards CLI versions of this concept. Now, you're nonblocking, are in a more "native" environment for the LLM, and are much cheaper.
If that sounds regressive and hardheaded, Microsoft, in particular, has plenty of research on this subject, and there's a good amount from diverse sources.
Note the 20%-40% success rates they report, then, note that completing a full task successfully represents a product series of 20%-40%. To get an intuition for how this affects the design experience, think how annoying it is to have to repeat a question because Siri/Assistant/whatever voice assistant don't understand it, and they have roughly ~5 errors per 100 words.
I hope that's clearer, I'm a bit over-caffeinated
I'm thinking maybe the goal/state stuff might have clouded my point. Setting aside prompt engineering, just thinking of the stock AI UIs today, i.e. chat based.
Then, we want to accomplish some goal using GUI and/or CLI. Given the premise that I'd avoid GUI automation, why am I saying CLI is the way to go?
A toy example: let's say the user says "get my current IP".
If our agent is GUI-based, maybe it does: open Chrome > type in whatismyip.com > recognize IP from screenshot.
If our agent is CLI-based, maybe it does: run the curl command to fetch the user's IP from a public API (e.g. curl whatismyip.com) > parse the output to extract the IP address > return the IP address to the user as text.
In the CLI example, the agent interacts with the system using native commands (in this case, curl) and text outputs, rather than trying to simulate GUI actions and parse screenshot contents.
Why do I believe thats preferable over GUI-based automation?
1. More direct/efficient - no need for browser launching, screenshot processing, etc.
2. More reliable - dealing with only structured text output, rather than trying to parse visual elements
3. Parallelizable: I can have N CLI shells, but only 1 GUI shell, which is shared with the user.
4. In practice, I'm basing that off observations of the GUI-automation project I mentioned, accepting computer automation is desirable, and...work I did to build an end-to-end testing framework for devices paired to phones, both iOS and Android.
What the? Where did that come from?
TL;DR: I love E2E tests, for years, and it was stultifying to see how little they were used beyond the testing team due to flakiness. Even small things like "Launch the browser" are extremely fraught. How long to wait? How often do we poll? How do we deal with some dialog appearing in front of the app? How do we deal with not having the textual view hierarchy for the entire OS?
If you extract entire HTML and CSS your cost + inference time are quickly 10x.
[0] https://chromewebstore.google.com/detail/namebrand-check-for...
I also saw one doing Captcha solving with Selenium [1].
I will keep an eye on your development, good luck!
[0] https://www.multion.ai/ [1] https://github.com/VRSEN/agency-swarm
With captcha, worst case scenario is using a service to do it as part of the agent flow. See 2captcha service
Playwright and selenium automate the browser itself, but with the chrome extension you need to use the context of the current browser.
I’m not an expert in browser automation so found it challenging moving from playwright to make it completely browser based.
What I did though is having a loop to send instructions from playwright.
For instance, I will open the browser, and then enter a loop to await for instructions (can be from event such as redis) to execute again in the same browser. But still, it’s based on the session instantiated by playwright.
After having spent the last several years building a popular Chrome extension for browser automation [1], I was excited to see if LLMs could actually build automations end-to-end based on a high-level description. Unfortunately, they still get confused quite easily so the holy grail has yet to come. Still fun to play around with though!
Ah, well there's your problem. Your problem isn't docker, nor is it claude, its that you're running Windows.
Compatible with any LLMs and agentic framework
* Cerebellum (Typescript): https://github.com/theredsix/cerebellum
* Skyvern: https://github.com/Skyvern-AI/skyvern
Disclaimer: I am the author of Cerebellum
I see Cerebellum is vision only. Did you try adding HTML + screenshot? I think that improves the performance like crazy and you don't have to use Claude only.
Just saw Skyvern today on previous Show HNs haha :)
BTW I really like your handling of browser tabs, I think it's really clever.
Thanks man, Magnus came up with it this morning haha!
Do you think: 1. Websites release more API functions for agents to interact with them or 2. We will transform with tools like this the UI into functions callable by agents and maybe even cache all inferred functions for websites in a third party service?
agent = Agent(
task='Go to hackernews on show hn and give me top 10 post titels, their points and hours. Calculate for each the ratio of points per hour.',
llm=ChatOpenAI(model='gpt-4o'),
)
await agent.run()
Passing prompts to a LLM agent... waiting for the black box to run and do something...I guess I wrongly assume regular API’s are more reliable, but you’re right they’re basically black boxes too…
How would you imagine the perfect scenario? What would make LLM outputs less of a black box?
Can it use a headless browser?