Dive into what is LLMOps
Rohit from Portkey is joined by Weaviate's Research Scientist Connor where they go on a deep dive about the differences between MLOps and LLMOps, building RAG systems, and what lies ahead for building production-grade LLM-based apps. This and much more in this podcast!
Rohit Agarwal on Portkey - Weaviate Podcast #61!
Host: Hey everyone. Thank you so much for watching the Weaviate podcast. I'm super excited to welcome Rohit Agarwal from Portkey. Portkey is a super exciting company, making it easier to use LLMs, serve them in production and routing to different LLMs, saving costs, and all sorts of interesting details that we're about to dive into.
Firstly, Rohit, thank you so much for joining the podcast.
Rohit: Excited to be here, Connor. Looking forward to the discussion.
Host: Awesome. So could we kick things off with kind of the founding vision of port key, sort of the problem that you, set after to tackle
Rohit: Absolutely. So I think building LLM applications in the last three years or so at multiple companies, and as we built out those applications, we realized that.
Building out the first version is super easy. You can do it as part of a hackathon and you get really excited because it's been 48 hours. You've built something that's worthwhile. It's valuable. And LLMs and vector DBs are like really powerful tools at your disposal. So you build it out and you're now excited to show it to the world.
The minute you start doing that or getting to production. You realize that a lot of the engineering challenges that were solved in the traditional engineering world are probably not solved in the LLM engineering world. So think about uh, you're so used to using Datadog for logging But how do you put these large chunks of strings in Datadog?
It doesn't work. Similarly, vector databases behave a little differently than regular databases. So how are you monitoring them? How do you make sure that your app stays reliable, secure, and compliant? A lot of these things, I think the core layers of DevOps have not been built for the LLM. , an example I often like to take is for every other API, you get a success or an error.
So it's very deterministic, but for an LLM or even a vector database, it's all probability. Everything's between zero and one.
So every system sort of has to adapt to this new reality of probabilities and not, you Deterministic outputs. So that was sort of the background where we felt, okay, there needs to be an LLM ops company.
We had written down sort of a manifesto back in December, and January on what this could look like. But then we started building this out in early March
Host: These are so many exciting points. I like the Data dog monitoring thing is core topic to Weaviate. I want to ask - LLM ops versus ML ops.
Can you maybe separate the two a little more?
LLM ops versus ML ops
Rohit: Absolutely. I think it’s very interesting. Some people say that LLM Ops is a subset of MLOps. I would disagree a little. The core difference is that MLOps focuses more on servers, while LLM Ops is more on services.
For MLOps, you have core metrics like drift and accuracy. However, these terms are often unheard of in the LLM world, unless you're building your own foundational model, where you might worry about drift. In 99 percent of use cases, you're using a deployed LLM model that’s pre-trained and ready to use.
So, you're more concerned with using the APIs—what the latency is and how accurate they are. This involves evaluations, output validations, retry metrics, and other factors, in contrast to MLOps metrics.
The difference is significant, especially regarding test data and training data. For LLMs, the majority of companies aren’t doing any testing or training at all; they just pick an API and start adding. I think that highlights the big difference between MLOps and LLM Ops.
Host: Yeah. I think before we delve further into the more academic topics around LLMs and machine learning, I'm really curious about the state of the model inference API market.
Model inference API
Rohit: I think it’s amazing. A year or two ago, you had to deploy everything on your own. People would rush toward a SageMaker setup where they were deploying their own models and testing them out. There was so much to be done, and everyone was busy for six to eight months. Then something would come out, and you’d iterate and move forward from there.
With the introduction of inference API endpoints, the game completely changed. Now, most companies are happy to let someone else manage their deployment. How they speed up their GPUs and effectively utilize the entire core of the machine is not my expertise, and I’m not going to spend time on it. What you can do is use these APIs to solve business problems.
Mosaic posted an interesting statistic, and Databricks noted that LLMs are probably their fastest-growing and fastest-adopting segment compared to anything they’ve seen before. This shift is likely because, earlier, you were constrained by a lot of data science, machine learning training, and testing before introducing business logic. Now, business logic can be introduced on day one, with testing and training happening later when you’re focusing on accuracy, latency, or cost.
I think we digressed a bit from your question, but inference API endpoints have made it super easy for more companies to utilize transformers as a technology. Consequently, fewer companies want to deploy foundational models themselves. Instead, they seek next-level fine-tuning, data privacy, or compliance metrics. There are various companies offering these services at very affordable prices.
We’ve worked with players like Banana Model before; it’s so easy to collaborate with them. They say, “Hey, we already have these models pre-deployed; you can fine-tune with us, and go live,” and it makes deploying a model straightforward. The LLM world is getting commoditized to a large extent, where I don’t need to manage all these GPUs myself.
In fact, I see three layers of companies evolving. First, there are companies providing bare-metal GPUs, where you deploy a model, and train it, and that’s not for the faint-hearted. You need ML expertise and data scientists in place. Then, you have the next level of infrastructure companies that deploy open-source models or help you fine-tune them at their end. DLFix does this to some extent with MPT and others.
The third category includes companies like OpenAI, Anthropic, and Cohere, which offer end-to-end solutions. You just need to manage your API key, and everything else is taken care of. Depending on how deep you want to go or how mature you are as an AI company, organizations tend to choose different layers.
Host: Yeah. Later in the podcast, I definitely want to get your thoughts on running closed models versus fine-tuning language models, and where your sentiments lie on that. But I want to stay focused on Portkey and the product first. That was a really great overview; I love the three levels, from bare metal to running open-source models or fine-tuning them without having to worry about deployment. I think that’s a super interesting emergence in this market.
So, something I find fascinating about Portkey is that I use GPT-4, I use Anthropic’s Command nightly, and I use Claude, but I’m getting rate-limited by OpenAI. How do you think about managing multiple LLMs in applications?
Managing multiple LLMs in AI applications
Rohit: Absolutely. Think of this in parallel to the traditional DevOps world, which has the concept of API gateways and load balancers. We’ve implemented similar concepts here with an AI gateway or an LLM gateway that connects and load balances across multiple keys or accounts of OpenAI, or even between OpenAI and Anthropic.
This setup can handle load balancing, fallbacks, retries, and even canary testing. Portkey makes all of this possible by connecting to various providers—both closed-source and open-source, including Banana and others. The user only needs to call one endpoint and define a configuration for how they want the call to traverse.
Portkey orchestrates the entire call, ensuring you get the fastest response with the best accuracy while also being cost-efficient. You can choose whichever model you want at any time, and Portkey manages the rest.
Host: So, let me quickly ask about the model inferences you’re routing. Are you mostly seeing OpenAI, Cohere, or maybe Azure Bedrock? What kind of ensembles of APIs do you typically see organized this way?
Rohit: Yeah, it’s been surprising to me. The most common setups are actually OpenAI and Azure because people are trying to maximize their limits, and Azure offers extra rate limits. Sometimes, Azure is faster than OpenAI's entry points. So, multiple accounts of OpenAI or a combination of OpenAI and Azure are the most commonly load-balanced systems we see on Portkey today.
Many companies are also trying out Anthropic as a fallback because they don’t want to be vendor-locked into only OpenAI. They’re uncertain about how OpenAI’s model will evolve. With Portkey, it’s super easy to set this up; you don’t have to configure anything on your end, and we manage the fallbacks to multiple providers seamlessly.
Another interesting trend that’s just beginning to emerge involves companies like Banana. As organizations become more mature and start serving more calls, they have the data to fine-tune a foundational model for a specific use case. They’re saying, “Let’s try load balancing between OpenAI and my fine-tuned model and evaluate the results.”
For example, we might implement an 80/20 strategy. Drawing from traditional engineering concepts, it’s similar to blue-green deployments. A company may want to use their fine-tuned model but isn’t sure how well it performs. So, they’ll send 80 percent of their calls to OpenAI and 20 percent to their fine-tuned model. They can then use evaluations and feedback to compare performance. If the fine-tuned model performs well, they can gradually increase the load until they reach 100 percent.
This approach is becoming an interesting space for companies as they mature.
Host: Yeah, I think you painted the picture perfectly regarding how to deploy a new model and integrate it with your software. It's all very clear. Now, maybe we could step into the broader topic of fine-tuning LLMs.
To share a quick story, I’ve been looking heavily into the Gorilla models, which involve fine-tuning a language model to use a specific set of APIs. In our case, we’re fine-tuning language models to use the Weaviate API. This could result in an auto API feature where you approach Weaviate and say, “Hey, I want to make a VM25 search,” but you don’t yet know the GraphQL syntax. The language model would then produce the necessary GraphQL for you.
This has been my experience with fine-tuning language models. It’s becoming easier; there’s even a Replicate tutorial on Llama 2 that shows how to fine-tune an LLM. With Mosaic ML being acquired, this approach is gaining traction.
So, I’d love to hear your sentiment on the evolution of fine-tuning LLMs and how common it’s becoming in the market.
Evolution of fine-tuning LLMs
Rohit: I think we’re witnessing a maturity curve among organizations adopting LLMs. Companies typically start with something very safe and well-defined, then move toward extracting the next level of efficiency.
There are probably two or three key factors at play. First is maturity; as organizations mature, they seek to fine-tune their models for better results, to outsmart the competition, and to build their own advantages. This is where fine-tuning becomes particularly useful.
Second, some companies are very particular about data and privacy, preferring to work only with their own hosted foundational models. It's interesting that Azure now offers co-located servers.
I don’t know what exactly they call it, but they say we can deploy or "VPC-pure" the OpenAI APIs to your API, and that’s how it works. I think that’s where companies are becoming more interested in fine-tuning foundational models. Llama 2 is really good in terms of actual business use cases and how you can get started with it, especially for smaller use cases like Gorilla, which is a beautiful example.
I’d love a world where you don’t need to invoke or remember the API endpoint, make mistakes in the body of the post request, and then fail. Instead, you could just send a text query saying, “Hey, this is what I want.” If something fails, it could reply back in text, and I can respond again.
I think that can create a natural conversation, where maybe Connor and Rohit are talking on a podcast, and servers can talk to each other using natural language, establishing a new API standard. That would be fantastic, and these limited use cases could be handled much faster using a fine-tuned LLM.
I think that’s where people are now trying to figure out what these small use cases are, pick them out, and create fine-tuned LLMs on top of them. It would be unjust not to mention that almost everyone is currently looking forward to OpenAI releasing fine-tuning capabilities for GPT-3.5 and GPT-4. That could be a major step forward for a lot of people because these models are already really good. If I can fine-tune them on my own data and get all their latent capabilities to my data, that would just be fantastic.
I think fine-tuning is going to become, I’m definitely looking forward to more and more use cases on fine-tuning, both on foundational models and on the players that are already ruling the market.
Host: Yeah. I’ve been thinking about a couple of interesting things with fine-tuning. The first one, I’m not sure if it’s worth continuing the conversation on, but it’s this idea. Earlier, maybe seven months ago, I had Jonathan Frankel from Mosaic ML on the podcast, and he explained the idea of continued language modeling on your data.
So, say I want to do this Gorilla model that writes Weaviate APIs. Instead of jumping right to instruction tuning—like, did you write the API correctly?—you also continue language modeling on a gigantic dataset, like Weaviate documentation. It's kind of like an intermediate step. I think about whether that's really needed. That’s one topic.
But the second topic, which I think is more interesting, especially for Weaviate and particularly retrieval-augmented generation, is something I took from the Gorilla paper. They’re going to be doing retrieval-aware fine-tuning.
What this would look like is, most of these tutorials on how to fine-tune an LLM with Replicate just say, "Give us the data." You retrieve data, put it in, and then it’s still instruction-tuned. You put the retrieval with the input. I think this kind of retrieval-aware fine-tuning is going to be important.
Especially for us—maybe I could ask you broadly—how do you see retrieval-augmented generation? With Weaviate and vector databases, that’s like the gold mine right now.
Retrieval-Augmented Generation
Rohit: Totally. Yeah, I think RAG (Retrieval-Augmented Generation) is definitely the flavor everyone’s using. It's amazing when it works, right?
I remember a blog post by Goran back in late 2020 where he talked about context stuffing to make LLM outputs better. Context stuffing used to be about examples, context, or anything you could give the LLM to constrain the output it produced. A lot of companies, especially early content generation companies, used it quite a bit.
Then, when RAG implementations came out, it was amazing how people started using it for a variety of use cases. You have base RAG, and now you also have interesting chain-of-thought experiments on top of it. It’s like, "I've asked you a question, now let's do a chain of thought, reason through it, classify, and get to the right answer accurately." So, that whole chain evolving is really fascinating.
RAG itself has multiple flavors. I was talking to someone yesterday about how internal company use cases versus external company use cases can be very different with RAG.
Rohit: You need a lot more permissions, access management, and control management for internal use cases compared to external ones. For instance, if you’re doing customer service, there might be fewer restrictions on permissions, but you’re searching a much larger dataset. So, how do you manage that using something like a hybrid search in Weaviate?
Those are some of the interesting concepts people are now exploring. Another big concern is data leakage. We've been actively discussing this with a customer who’s worried about leakage now that all of this data, which used to be segmented in their database, is being moved to a vector database. How do the embeddings stay as segregated as their original data? And what are the compliance requirements around that?
These are all the questions people are tackling as they move from basic RAG implementations to addressing legal, security, and compliance concerns.
Host: Yeah, I’m not sure I see embeddings all the way yet, but I remember with language models how you’d prompt it with something like, “Email these from Bob Van Light to start.” That could be a problem with language models.
Yeah, and that multi-tenancy issue—when we get to 1.20, it introduces a major revamp for that. For listeners, Eddie Delacroix, the CTO, explains it well. That's something I wouldn’t have grasped just by looking at the papers; you really need to go out there and see what a problem it is.
It’s all really interesting. So, regarding fine-tuning, there are significant compliance issues related to business sensitivity. This ties into whether GPT can really be your doctor, for example.
Pivoting a bit, I want to return to Portkey and discuss the cost savings of running inference. We talked about a load balancer that primarily manages rate limits. I think people know that if you exceed the limit, it says, “Hey, stop it.”
You might also consider first asking the question to a cheaper language model, seeing the answer that comes back, and then deciding whether to send that to the user, rather than immediately going to the more expensive one.
Rohit: Yeah, absolutely. We're starting to see implementations where people make an LLM call to the cheaper model for evaluation, and then decide whether to send it forward. A simpler approach is to use the cheaper API call, especially since it works 80% of the time. Users can regenerate responses, particularly for low-hanging use cases where precision isn’t critical.
In this scenario, the second call would go to GPT-4.8, creating a mix of both models. The LLM call evaluation followed by a second call is an interesting trend we’re seeing among some companies, particularly because there’s often a 10x price difference, making it much more cost-effective.
Additionally, in terms of canary testing, many people will initially make calls to both APIs to gauge their similarities. This gives them a minimal level of confidence that the cheaper model performs at least 80% of the time, allowing them to gradually expose that option more widely.
Host: Yeah, I think there’s already significant value in routing based on rate limiting and accuracy. Diving into a more academic perspective, there’s the idea of having GPT-4 as a master model that routes requests, similar to how Gorilla operates. In this scenario, it could route to a tool while a smaller LLM formats the request, particularly for something like a Weaviate GraphQL API.
There's also the concept of turning the Hugging Face model hub into a sort of app ecosystem, where each model functions like an app. For example, you might have an image segmentation model that generates a segmentation mask, which is then sent to a classifier. Do you think this kind of orchestration of models could be a potential direction for Portkey?
Rohit: Yeah, I think not right now. We've primarily focused on production use cases where companies have already implemented solutions. The idea of agents and tools is still in its early stages and hasn’t been widely deployed yet.
While there's significant academic interest in viewing models as tools within agents, I haven't seen anyone successfully deploy this concept. Instead, we often see a simple prompt router in front of the LLM, which decides which model to send the call to or whether it requires additional context from a vector database before passing it to the LLM.
In many business use cases, a smaller model tends to be more efficient. Adding latency from an entire chain of operations to get a response isn’t desirable. Businesses want to minimize delays, so we aim to drop off at the right point to avoid unnecessary latency.
We'll touch on caching later, but I think semantic caching is crucial for delivering faster responses. Latency is something businesses truly care about. For asynchronous tasks where a delayed response is acceptable, we can experiment with routing across multiple LLM calls. However, for most scenarios, the goal is to get answers as quickly as possible.
Host: Yeah, I love the idea of using LlamaIndex and query routing. For example, having one language model that analyzes a query and determines whether it’s a vector search query or a SQL symbolic query is fascinating. You don’t need a massive 300-billion parameter model to handle that kind of routing.
It’ll be interesting to see how LLM frameworks evolve to host different models for various routing or tool selection needs. This could become a whole new area of development.
But let's dive into our debate topic from our meeting in Berkeley: semantic caching. I admit that when we first discussed it, I was a bit skeptical. I thought the nuances of prompting were too specific for caching to be effective. However, you explained its relevance in question answering, and I see how semantic caching for LLM calls could be quite intriguing. Could you set the stage on this topic?
Semantic Caching for LLM calls
Rohit: Absolutely. Caching has been a topic of discussion for a long time, and we’ve seen effective caching use cases with LLMs. However, when we implemented caching for our users, we observed a cache hit rate of only about 2%.
This isn’t surprising, given that users often ask the same questions but in different ways. The promise of LLMs is that they eliminate the need for rigid, deterministic inputs, which makes caching more challenging since the queries aren't identical.
However, in specific use cases like customer support or employee support, there’s still a significant overlap. For instance, 60% to 70% of employee or customer queries tend to be similar, especially if we look at a narrow time frame, like the last seven days.
How can we ensure faster responses for previously answered questions? For example, if one person asks, “Does Weaviate have a free developer plan?” and another asks, “I’m a developer; does Weaviate have a plan?” Both inquiries essentially seek the same information.
To handle this, we could either use a complex LLM call or extract the intent from incoming prompts, storing it as an embedding in a vector database. By matching incoming queries to our list of previously answered questions, we can effectively build a semantic cache.
We started exploring this problem and aimed to implement semantic caching. The results were impressive: without any optimizations, 20% of our queries achieved 90-96% accuracy, and these queries were 20 times faster.
However, there are challenges. I recall our discussion about Milvus launching with GPT cache, which made me skeptical. The fundamental issue with a basic semantic cache implementation is that, while it can theoretically work, it doesn’t address the edge cases. In the 5% of instances where it fails, it may provide absurd answers very quickly, leading to the impression that the response came from the cache.
It's absolutely wrong, and you've probably leaked some data because your question and answer are completely different. That was a question asked by somebody else, and the context was very different. This is where we started our journey to understand semantic caching and how it works.
We ended up implementing diff-match-patch libraries. We extract relevant parts of the prompt that need to be cached, as the majority of the prompt is similar and doesn't need to be cached. The challenge is determining how to extract that information. There’s a lot of post-processing involved to ensure there’s a quick evaluation check to see if the question and answer are relevant to each other.
Additionally, we run a constant evaluation model to determine the right vector confidence level to achieve 99% or 99.5% accuracy. It’s this entire production chain that builds a robust semantic cache. We’ve spent a lot of time fine-tuning these small details to ensure the entire system works effectively.
Now, we see that customers start on day one with 15% to 20% of their responses served from cache. In some cases, we've seen this number rise as high as 60%, which is phenomenal for them in terms of cost savings and user experience. We still need to work continuously to improve accuracy, but I believe we've reached a stage where it's super useful for enterprise search use cases and customer support.
Host: Yeah, first, I think there were a ton of nuggets and great details, and I really enjoyed it. I liked how you ended up with the idea of checking for failure cases. Can you calibrate the vector distance of the similarity from the query to the questions you have cached? I find that to be very interesting.
We've looked into auto-cut, where we plot vector distances on the Y-axis and results on the X-axis, trying to check a slope to say, “only give me three out of the hundred search results.” However, for the top one, I don't see a good solution to that. Calibrating your vector distance scores for your specific use case is tricky.
I might be going off-topic, but there's also the concept of conformal prediction, which seems relevant—like in causal inference, where emerging categories of research are becoming more prominent. It appears that research on uncertainty is catching up.
Taking a step back, I remember brainstorming with Weaviate about how to add a site search. I thought one interesting approach would be to create a frequently asked questions section, where the embedding of the query could be matched to one of the frequently asked questions.
But then you mentioned this cold start phase of 15 to 20 percent evolving to 60 percent, which is interesting because you might not have a comprehensive FAQ yet. So, you can start using your language model QA system until you build up this base, and that's where the cache becomes critical.
It’s really fascinating to me. I can also imagine the risk of data leakage—if I come to you with a specific schema while asking about my bug, and then you end up providing some of that information in your response. A potential solution could be a language model that takes my query and suggests a rewrite to make it more abstract, and then that result is what gets saved in the cache.
I think this could really supercharge question-answering systems. The whole question-answering aspect is one of the biggest applications, but my question is: when is a task suitable for just embeddings and vector search alone, compared to when you would need a generative model for that task?
Rohit: Yeah, I think for anything related to classifications—like classification, topic modeling, or anything that has a deterministic answer—you need a deterministic question to receive a deterministic answer. In those cases, embeddings are probably the best solution.
Let's break it down: for deterministic input and deterministic output, embeddings work well. For subjective input but deterministic output, you might use functions or embeddings. You can have an LLM call that returns JSON or an embedding; it depends on which performs better.
For subjective input and subjective output, you definitely need an LLM call, ideally with a cache on top if you're seeing a lot of similar questions coming in. This is where I see RAG (retrieval-augmented generation) use cases, Q&A use cases, and search use cases—all of which involve subjective input and output, necessitating an LLM call with a vector database in place.
What have you learned from observing all these systems at Weaviate so far?
Host: Host: Well, I think one example is Kappa AI, which was set up with Weaviate. They just got into Y Combinator, and it’s really exciting to see their potential for growth. It makes me think about how many people criticized apps that were seen as just wrappers around the GPT API. But with Kappa, they're collecting customer question-and-answer pairs, similar to what we've done with Flag in the LlamaIndex Discord.
By building up this dataset, their competitive advantage lies in the data they can cache for cost savings. This light bulb is going off in my head because you don’t necessarily need to create a moat through fine-tuning with gradients, which is often what AI practitioners think—they believe you have to train the model intensively.
Instead, having a well-curated dataset for retrieval-augmented generation can be just as effective. I’d love to pass it back to you to discuss the debate between improving performance by fine-tuning the language model versus focusing on great data for your retrieval-augmented generation stack.
Rohit: Yeah, I would love to conduct a proper benchmark on this, as I honestly haven't seen one yet. My general guess would be that if it's a very repeatable use case, then fine-tuning might work out better. For outputs of similar character lengths that contain almost the same information, the parameters may vary, but fine-tuning is great for constraining the model. You can keep improving accuracy and refining your outputs.
Fine-tuning really shines in scenarios like product descriptions, where you can keep constraining and enhancing your dataset. However, the moment you aim for variance, that’s when you need embeddings and vector search. At least that’s been my practical experience so far.
Fine-tuning and embeddings can coexist effectively. I would love to see an impressive retrieval-augmented generation (RAG) use case where a model is fine-tuned, and the right context is passed in. This way, the model understands its helpful tone and has context about previous questions and interactions. However, providing context in the prompt is more valuable than relying solely on the LLM’s memory, as it contains so much information that it can be challenging for it to prioritize the right details at the right time.
Host: Yeah, I definitely have some thoughts on fine-tuning. The gorilla experiment I mentioned earlier is retrieval-aware fine-tuning, where you retrieve the API documentation and then get the schema for how to perform BM25 searches. It would then write the BM25 query.
So, for me, fine-tuning is about steering the model into a narrow pocket of functionality. If I fine-tune it on 60 APIs and use search APIs, but I've only trained it on 40, it might not generalize well to the remaining 20. Or if I train it on all 60 and then 20 new ones come out, it might struggle to generalize to those new APIs as well as GPT-4 can. That would significantly impact my sentiment on fine-tuning.
Additionally, I've been experimenting with summarizing podcasts using one of those summarization chains. In this method, you receive the clips one by one, along with the summary so far, and then use that to improve understanding. However, it doesn't seem to deeply comprehend the context.
I want to pivot the conversation back to being at the cutting edge with Portkey. This new layer in the software stack gets us closer to cheaper LLM inference. I’d like to discuss two things that cheaper LLM inference could unlock. Sorry for the context switch, but the first is this tree of thoughts concept, where we could explore multiple pathways. Aside from caching, I suppose there could be potential to cache these paths. As inference becomes cheaper, what do you think the potential of that could be?
Rohit: Absolutely. Many tasks that once required deep decision trees can now be handled by LLMs through reasoning. However, this will depend on both cost and latency. If your routing takes three seconds and inference takes another three, you've essentially doubled your latency, which is undesirable.
So, cost is a significant factor; without cheaper options, routing won't be feasible. However, if I could perform two queries at a manageable cost and inference time, that would be revolutionary. I can envision a scenario where many software developers would stop writing workflow code altogether.
Why do we need to create workflows and software, which are essentially just decision trees, when an LLM can make decisions extremely quickly, cheaply, and accurately? A compelling example of this is the rise of "use and throw" software. Imagine spinning up a quick Replit instance and having ChatGPT write code for me. For instance, I might say, "Create 50 PDFs with this text in it—these are appraisal letters for all my employees. Upload them to my Google Drive link attached here. Ask me any questions."
If the LLM could navigate the entire process and deliver the output, it would involve routing, inference, and a bit more. This could make things really interesting because workflows might become unnecessary; we could simply rely on the LLM to build them for us.
This differs from agents, which perform multiple inferences, gather data points from those inferences, and use them accordingly. Instead, I envision the LLM taking control and creating the entire decision tree. However, I don’t think we’re there yet; we need extremely fast and cost-effective inference for this to become a reality.
Host: I completely agree with that perspective. Earlier in a podcast, Colin Harmon described auto-GPT as a sort of search for workflows. It also reminds me of the "demonstrate search" approach from the market lab at Stanford, where they provide a few input-output examples, and you compile the workflow that would produce those inputs and outputs. All this thinking is quite fast.
As the cost of inference gets cheaper, it brings us to the next point I want to discuss: generative feedback loops. We’ve been trying to evangelize this idea, which involves generating content and then saving it back into your vector database.
A simple example could be taking my new blog posts and personalizing a message, like "Hey, why you should care about this blog post," to everyone in my CRM. After that, I save the responses back in a database. Broadly speaking, how do you gauge interest in that idea? We’ve already discussed it with semantic caching, which is essentially about answering questions and saving the responses.
What excites me most is the concept that vector search is about scaling vector distance calculations. Some people joke on Twitter about implementing a vector database just using NumPy to brute-force through a hundred thousand vectors. However, the real excitement lies in searching through tens of millions or even billions of vectors. The potential of generative feedback loops can be a game changer, especially as everyone begins to consider handling data at that scale.
For instance, if I link to my blog post and the system writes and compares relevant parts of the content, it creates a useful latent space. What do you think about that idea?
Generative feedback loops
Rohit: Yeah, I think it’s interesting. I haven’t given it much thought until now, but I can imagine people building data lakes to store a bunch of data for querying or training purposes. This concept seems almost like a vector data lake or vector link, where you have so many embeddings that you can perform various computations, calculations, and training on it.
Semantic caching is a similar idea; we're storing outputs for specific use cases. I can definitely think of evaluations, validations, and even fine-tuning as potential applications.
What other use cases have you seen for generative feedback loops?
Host: Well, I agree that semantic caching might be the new headline for this concept. The idea of using a newsletter with a CRM to write personalized messages is interesting. Another aspect specific to the search is that if I take our podcast and start indexing it, I can use the raw podcast clips. I pass those to a language model and ask it to summarize the content.
By indexing the summaries, I achieve much better search results because the transformed content enhances the embedding. This has been my favorite approach in the search context—transforming the data for a better index.
There’s also the ability to extract structured data from the text chunks, which makes chunking text with vector databases a deep topic. The headline here is that technologies like Portkey and the overall trend of LLM inference costs decreasing are providing more options. For instance, we have the latest Llama 2 model, and before that, the MPT 30 billion model was released. As these technologies continue to become cheaper, I believe concepts like the tree of thoughts and generative feedback loops are currently hindered by their cost.
Rohit: 100%. I think it’s about the cost of inference, as well as the current resistance from people to use multiple language models. However, I believe we will start to see over the next few months that people begin to utilize maybe two, three, or even 20 different models for various use cases. There will likely be some orchestration between them to ensure optimal performance based on the CAP theorem—balancing cost, accuracy, and performance.
You’ll have a small router in the middle, distributing calls to the appropriate models. This approach will enable companies to collect and store more data. As LLM inference costs continue to decrease, I anticipate that vector searches will also become cheaper, leading people to want to store and search through larger datasets.
Host: Yeah, I definitely think of this as speculative design theory, where we’re looking far into the future. What you mentioned about managing 15 to 20 model inferences—whether that’s using GPT cloud or introducing fine-tuned models—is absolutely a massive emerging space in the AI market.
Thank you so much for joining the podcast! This was an incredible discussion, and I had a lot of fun picking your brain on these topics. You've definitely changed my mind on semantic caching; I now see it as very useful. I’m excited to follow Portkey and see how all these developments unfold.
Rohit: Absolutely. It was great chatting with you. As I was talking with you, a lot of new ideas also popped into my head, so I'm gonna write about them now. But thanks so much for inviting me, Connor, was amazing doing this with you.