Current stack: - Next.js on Vercel - Serverless functions for AI/LLM endpoints - Pinecone for vector storage
Questions for those running AI in production:
1. What's your serverless infrastructure choice? (Vercel/Cloud Run/Lambda)
2. How are you handling state management for long-running agent tasks?
3. What's your approach to cost optimization with LLM API calls?
4. Are you self-hosting any components?
5. How are you handling vector store scaling?
Particularly interested in hearing from teams who've scaled beyond prototype stage. Have you hit any unexpected limitations with serverless for AI workloads?
1. Probably the best is fly.io IMHO. It has a nice balance between running ephemeral containers that can support long running tasks, and quickly booting up to respond to a tool call. [1]
2. If your task is truly long running, (I'm thinking several minutes), probably wise to put trigger [2] or temporal [3] under it.
3. A mix of prompt caching, context shedding, progressive context enrichment [4].
4. I'm building a platform that can be self-hosted to do a few of the above, so I can't speak to this. But most of my customers do not.
5. To start with, a simple postgres table and pgvector is all you need. But I've recently been delighted with the DX of Upstash vector [5]. They handle the embeddings for you and give you a text-in, text-out experience. If you want more control, and savings on a higher scale, have heard good things about marqo.ai [6].
Happy to talk more about this at length. (E-mail in the profile)
[1] https://fly.io/docs/reference/architecture/
[2] trigger.dev
[3] temporal.io
[4] https://www.inferable.ai/blog/posts/llm-progressive-context-...
I actually tried fly.io briefly with Next.js apps and the deployment experience was smooth. Really interesting to hear you're using it for AI workloads too.
For fly.io with AI workloads: Are you using their Machines or Apps? I'm particularly curious about how you're handling cold starts for LLM tasks, since that was one thing I loved about fly.io for regular Next.js deployments - the cold starts were minimal.