
Deep Papers
Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.
Deep Papers
LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection
For this week's paper read, we actually dive into our own research.
We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost.
So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models.
We talk about what we built, the process we took, and the bottom line results.
📃 Read the paper: https://arize.com/llm-hallucination-dataset/
Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.
John Gilhuly: Hello, everybody! Good to see people. Alright, as we're maybe waiting for people to jump in, we can do a little quick intros before we go through.
Happy to kick that off from my side. I'm John, for anyone I haven't met yet. I am running the devrel team here at Arize. So get to do a lot of different things with our technical facing field facing work that we do. And one part of that is some research, which is something we're going to talk about a little bit more today. So excited to get into all that as well. And, Julia, I'll let you do a quick intro, too.
Julia Gomes: Hi, everyone! I'm Julia. I'm a product manager at Arize, and I spend a lot of time working on evals and thinking about evals. So this is a fun one to dive into. And before this, I actually worked in ML research in the self-driving space. So it's been fun kind of getting back into this world.
John Gilhuly: Awesome. Yeah. So today is going to be an interesting one. So a little bit of an intro as we have everybody joining. We're actually going to be going through a paper that we have written ourselves. So this is a white paper that we put together that details some research that we've gone through really over the past 6 months almost at this point. It's kind of taken us quite a bit of time to put everything together. And so this is an interesting area that we're really excited to dive into. It's been a team of people pulled from different parts of the company that are really doing a lot of the work around this. What we're calling the LibreEval Project that we've been working on.
So very excited to share this with everybody. This is kind of the launch of this in a way. We actually just put up some of our website and actually released the paper. So you guys are getting kind of a preview of this right as it's going up and going live. We're kind of timing it around this, so excited to jump into it. As always, as we're going through, feel free to drop questions or drop any sort of comments that you might have in the chat as we go through, and we'll kind of tackle questions as we go. So save me for the end.
Okay, so we can probably jump in as we keep going. There's a link to our paper that just got dropped in the chat. It will link you to a homepage for the website, and then you can click on the view the paper link there to see the full research paper.
So a little bit of background on what we did here. So rewind about 5, 6 months ago, a group of us were talking about on our side this idea of the validity of benchmarks and some of the benchmarks that were in the space already, and how much those related to real-world use cases or usage.
And we kind of started by thinking about this idea of hallucination evaluation, which is really almost like context adherence evaluation in a way. So it's one of the most popular evaluations we see people doing on our platform. But the idea is, if you're running a RAG-based system, you've retrieved some context, and then you're going to ask a question on that context and then generate an answer based on the context of the question. And one of the really common evals is this hallucination eval or context adherence eval.
...of whether or not anything in the answer is not supported by the retrieved context. And so I guess one of the most common ones we see. There's some common benchmarks that exist for this. Things like Halloweval, or there's HotpotQA, is like a data set that gets used to test this and is used in one of our evals as well. And so we were kind of thinking, hey, these existing benchmarks that are out there are good, but they've now been out there for a long enough period of time.
...where they may have actually been kind of subsumed into the training data of some of the more popular models that are in the space. So I think we were thinking, hey, the odds of GPT-4.0 being trained on Halloweval are pretty high, and so when it's run against that benchmark, then it may have strong performance, but that may not translate into real-world performance from there. So we kind of had this idea of these benchmarks are maybe getting stale over time as they get incorporated into the training data of large models.
And then the other big problem that we were seeing is that a lot of our enterprise users and a lot of our customers who are running large-scale evals were mentioning that it was getting kind of prohibitively expensive to run large-scale LLM-based evaluation, especially if you're using top-of-the-line models. If you're using Claude 3.5 Sonnet or Claude 3.7 Sonnet now, you're using these big models, then you're essentially doubling your inference costs unless you're sampling by some amount. And so there was a big push that we were starting to see, of focusing towards using smaller models to actually try to evaluate your application9 . And so, with all this in mind, we figured we would try to take our own stab at creating a few things.
So this is all stuff that we're releasing now. So this is all out into the world as of now. And it is an OSS repo that we've created. So there's a repository that's linked from that page that has all of the code we use to generate some data sets as well as train some models to address some of the problems that we were talking about.
We also are releasing a data set which at this point is the largest open-labeled hallucination data set. It's about 77,000 rows, I think, or somewhere in the 70,000 rows.
And then we are also releasing some fine-tuned models that are trained on this data that we've generated to be able to detect hallucinations. And we'll go more into this, but we've specifically selected 1.5 billion parameter models. So pretty small models to try to get them to perform well on this particular task.
And so we're going to go through a little bit of the process we used to do this today. Again, this is all available in the repo. So you could use this as is, you could actually extend it from there if this is something that is relevant to work that you guys are all doing.
And so the process that we used to generate the data sets is as follows: so we would start with a config. And what we did was we started with a base website that you could point to and say, hey, this is going to be my context. Really, all we're doing there is scraping the website. We're scraping a lot of different pages on the website and selecting paragraphs from them that we're going to use as context. So you can substitute that in for your own text if you wanted to, but in our case we would start with the base web URL.
And then we had a set of configuration parameters of what language do we want to generate the row in. What are we going to sort of encourage hallucinations or not? So we had a concept of synthetic versus non-synthetic.
And so what that meant for us was, do we encourage the model to hallucinate in its response or not? Or do we let it happen naturally? And then what LLM do we want to use to generate the row? What LLMs do we want to use to judge the row whether it was a hallucination or not, and then a few other configuration parameters around question types and hallucination types and things that we can get into.
And so that config would then be executed. Here we would scrape data from the website, take the most what we call the most interesting passage from each page of that website, and then use that to generate a RAG pairing here. So we generate a question about that particular paragraph.
...and then would generate an answer to that question. And then, finally, we would have three different judges that would all label whether or not that particular example was a hallucination or not, and we would take the kind of majority vote of that label and use that as the associated label there. So we could use this process to generate. In the end, we generated 70,000 plus rows of this data across a whole bunch of different websites, different domain knowledges, different models, all these kind of different pieces as we go through there. And again, you can trigger this within the repo just by providing one of these configs as a JSON.
This question I see popping in around, how can you tell what is the most interesting part of that website? You scrape everything and then let alone decide? Or what does the process look like?
So we did let an LLM decide what the most interesting passage was. Basically, what we did is we would scrape a web page, and then actually, there is a config variable that we would use to say how many paragraphs we would choose on each web page. We need to do one or two. So we'd scrape the web page, and then we'd have an LLM kind of decide what is the most relevant or interesting paragraph on that site that we would then use for the context. It's a great question.
And I should mention we used a combination of different models for this. So for the generation process and kind of the RAG process, we would use GPT-4.0, Claude 3.5 Sonnet, and a Llama 3.2 model as well to kind of do the whole RAG generation process that we talked about here.
And so we use that process to generate a whole bunch of different data sets, or one large data set depending on how you want to think about it. So each example in that data set has an input context and output, and then has a label that's attached to it, along with an explanation. There are actually a bunch more columns that are in the data as well too around some of the configuration and things like that. But this is sort of key ones that you would see. And we generated data across a bunch of different languages: English, Japanese, Korean, Spanish, French, Chinese, and Portuguese. Most of the data set we did generate in English, but we covered a few different domains with each of those languages as well. So ended up with almost 54,000 examples of English hallucinations here.
And when we look at the data itself, we have a couple of different kind of distributions here. So I mentioned before, we had this idea of synthetic and non-synthetic, so synthetic is we actually instructed the model to hallucinate in its response that it gave back for the RAG system.
Non-synthetic is that we did not instruct the model to do any kind of hallucination or anything like that. We just, we had to do a normal RAG pipeline, and we let that hallucination occur naturally in this case. So that's obviously a much lower yield of the number of hallucinations. But we were still able to get a fair amount that way. And we have about a 60/40 split in the data set of synthetic versus non-synthetic examples in the data set. You can see there's a split of label in this case of factual or hallucinated tags there as well. And then we had an idea of hallucination types, which I'll talk more about in a second. And then we looked at the language already as well too.
So a little bit more on the types. This is something that we actually had pulled from another paper in terms of the distribution of different hallucination types and what those each type meant in that case. But essentially what we did is we generated these different... we generate the a different set of kind of we take our our outputs from our whole data set, and then we'd actually run a model over them to categorize those outputs into different types. So those types were entity error hallucinations, relation error hallucinations, incompleteness hallucinations, where you could see some of the split there as well as outdated miss hallucinations, over claims, and unverifiable information.
And we tried to set this up in a way where we would have an even kind of even split and even distribution between each of these types. And so we actually had different prompts for the synthetic cases where we would trigger try to trigger different types of hallucinations. And we tried to do that equally, but ended up when we analyzed it afterwards that a lot of the responses tended towards the examples that you see here. So we got many more relation error hallucinations and incompleteness hallucinations than we did entity error, for example. So that was an interesting kind of thing that we got to see is that there is a sort of tendency to hallucinate in very specific ways, despite what instructions we gave it to a model in that case.
And one other interesting area that we did too is, we wanted to get a sense of how our LLMs as judges, our council of judges, because we had three judges that were labeling each example, how they compared to human labels. And so what we also did is we took a sample of the data we had some human labelers that we use through a labeling service actually label that data. And then our team manually went and checked the accuracy of the human labels and the LLM council labels. And we actually found that the LLM council labels were more accurate, at least when compared to our team's judgment of what the right answer was.
And so we found they were a little bit more accurate than the human labels, and we actually did 2 rounds of this test where one round we we gave the human labelers just the input context output and have them choose hallucinated or factual. And then the second time we actually gave them the LLM Council's output as well. And so we told them, hey, now, this is what the LLM Council said. Is that right or wrong? And what was interesting is that in the second case, there they almost always agreed with LLM Council, and they actually ended up getting a more accurate result .
So I will say, take this with a grain of salt. This is not to say that an LLM Council will always be better than humans. We used the labeling service, so we didn't have a lot of control over who our labelers were, and kind of how much time they spent with each case or anything like that . So there's a lot of unknowns that were in this system. I think we could have probably increased the human labeler accuracy if we did it in a more controlled way, but it was an interesting find . And this is something that was kind of an interesting thing for our team to see .
Awesome . So that's a little bit about the data set that we generated. And then a little bit about model performance on this data set. So from the data set, we actually tuned some of our own models . We tuned some GPT-4 Mini variants, we tuned some Quen 1.5 billion variants, and then we tested them against each other along with some base models as well .
And so here's a little bit of the data that you could see across the synthetic and non-synthetic split in our data . And so this was an interesting kind of finding that we had. Is that well, obviously, our fine-tuned GPT-4 Mini, for example, kind of topped performance when tested against our own data set's test set . But the other thing that was interesting is that there is a pretty fairly large difference here between non-synthetic and synthetic performance. So this is differences in the performance of models, abilities to identify hallucinations, and identify either synthetic hallucinations or non-synthetic hallucinations . So to say this another way, existing models had a much harder time identifying hallucinations when they were naturally occurring as opposed to ones that we had encouraged the models to make.
So that was an interesting kind of finding that we had that kind of supports a little bit of that idea that we had of certain existing models are trained on some of this data already, or they're more sort of able to recognize that information.
So beyond that, we also did some, we did all kinds of splits within our data, and you can see that in the paper if you want to go to that in more detail . But we also kind of checked the language split as well too. We kind of did English versus the non-English samples we had, and interestingly, we saw kind of equal performance across those, which was kind of surprising to us. We thought the English performance would be higher, but we actually saw fairly equal differences there.
And then we kind of started to see across our testing here that at least when tested on our own test set of our own data, then we would see kind of different stratas of models where the kind of best performing models were either our GPT-4 Mini fine-tune, or your GPT-4o's, your Claude 3.5 Sonnet, kind of your top LLMs at the time. And then right below those would be kind of our fine-tunes of our Flan models. So it was a smaller model that we fine-tuned that was able to kind of improve its performance there and then kind of the third tier would be the existing SLMs, so existing GPT-4 Mini, existing Haiku, at least on our own data, which was a promising result that we saw there.
However, when you start to look at some of the outside data sets and testing against some of those outside data sets, we saw a set of different results . So for example, this is a little bit blurry, so apologies for the blurry image. But you can kind of see that we had different data sets here on the bottom that we tested against. So LibreEval was our own test set, and then we also tested against a set of data sets used in Arize paper as well as Halloweval one, and those different splits there too .
And so you can start to see the differences in model performance as we go through that. GPT-4 Mini fine-tune, for example, beats out its existing kind of version here on our Libra Eval data set, and then it in some cases does slightly worse than its base model, which is interesting across some outside data sets . Again, we think this might be because some of those outside data sets are actually already in the training data for that particular model, which then means by fine-tuning it, we kind of confused it in a way by adding an extra information there compared to how it was treated for these other examples .
And then with our Quen and fine-tunes that we had here, they performed more in the range of what we saw from like a GPT-3.5 Turbo or something along those lines as we test on outside data sets . So again, we're excited to see how this kind of continues into the real world to see if it backs out some of our other conclusions that we have there .
And so last little bit on this before I turn it over to Julia, it's just a few kind of key takeaways that we saw . There were some interesting elements of where models performed better on the synthetic versus non-synthetic examples . It was interesting to see that language didn't have as big of an effect across some of these different models as well . And then, with the fine-tuning side of things, I think our general takeaway from this is that fine-tuning is actually really powerful to be able to bring some of these small models, these 1.5 billion parameter models, or these GPT-4 Minis, for example, up to a reasonable or even above reasonable kind of performance level.
And one thing to call out here is that the cost for us to fine-tune one of these Flan models, for example, is like $30. We did not spend a lot of money to have to fine-tune these . I think once we got our parameter sets and everything like that, we were tuning on 40,000 examples. We used Together AI to do this, and we were able to tune it for a very pretty cheap cost in that case . And so with that, you're then able to reduce the cost of your inference by potentially a factor of 10 if you're comparing to GPT-4 or some of these top-of-the-line models. So we do think that this is kind of presenting a fairly strong case for having a much more cost-effective way to run some of these evaluations .
There are a few things we could tackle from future work as well: improving the difficulty of our data set, using our human labelers a little bit better in a little bit more of a focused way, and then potentially expanding into some other evaluation tasks in the future . But we actually also spent a good amount of time looking at how we could incorporate this into some kind of improvement cycles within our own product. So for that, I'm going to hand it over to you, Julia .
Julia Gomes: Awesome. I also see we have one question in the chat that we can quickly address first, about using this on scientific documents with equations and some graphs . So I do think this workflow should work for that. We're basically able to fine-tune on any sort of data that you have. So it's fully customizable. And we actually use this with a lot of more scientific data and like health sciences and other areas. John, I don't know if you have anything to add to that .
John Gilhuly: No, I think that that covers it. We could use that for that. You'd have to change some of the kind of process and code and things like that that we had for it because we hadn't set it up specifically to work with images. But it would just be kind of plumbing the data together .
Julia Gomes: Well, now, I'll dive into the next steps . So as a product manager and someone who's, you know, formerly worked in a research role for me, it's really exciting to see when you can take cutting-edge research and then turn it into a really cool product . So I was just going to talk through some of the next steps we have here with integrating this LibreEval Project into the Arize platform, specifically our online evaluation platform, and then ultimately turning this into a data flywheel . And the idea behind a data flywheel is, you're constantly collecting new data that's used to improve your model .
So this is an image of what our platform looks like. Here we have online evaluations. And with online evaluations, you can use an LLM judge to continuously evaluate the data that you have in production . So let's say you have like a Chatbot that's deployed, maybe it's like a travel agent Chatbot that helps you or helps customers, you know, book flights and hotels. You can continuously have these evaluations running to evaluate that Chatbot. And in this paper, we've been focusing on hallucinations. So in this case, we have a hallucination template, and we're constantly evaluating whether or not each example is hallucinated or factual . And this is a way to just collect a lot of data over time . And a lot of our customers are using this. But one challenge which John was mentioning earlier is that if you're using models like Boro, it's very, very expensive . And in addition to being expensive, it's also not customized on the specific types of data and patterns that you have for your Chatbot.
So we're trying to integrate this with LibreEval because then it has a few advantages. One is that it will be much more cost-effective. So you can easily label all of your data, not just like a subset at a very low cost . And then you can also enhance the model performance by fine-tuning on your own custom data, so it can be very, very personalized . So like, if it is something like scientific documents where maybe the large foundation models don't have tens and tons of examples for training, you can fine-tune on those specific types of documents .
And again, this just shows, like, after you've sort of set up that evaluation, the types of labels that you would see in our platform . So here you can see examples labeled as either factual or hallucinated, and it's really easy to just quickly scroll through and find all those examples where there was a hallucination .
I saw a question in Chat: who does the labeling? So in this case, we're actually looking at labels from an LLM Judge, so it would be the LibreEval model, ideally fine-tuned on your own custom data, that's providing these labels for hallucination and factual . And then, later on, we'll see that that can actually go to human labelers in a labeling queue . So this is an example like clicking into one specific example of what this label would look like .
So here we have like inputs and outputs, and then the label will be provided by this LLM judge that says either factual hallucinated. We also have a correctness evaluator here . And then, in addition to that label, you get an explanation . And the explanation is basically the LLM explaining why it chose to label this example as factual or hallucinated, and it usually will reference some of the input text in its reasoning . And in addition to adding this like explainability and observability for why the element is doing what it's doing, the explanation also adds a reasoning component to the model. So it makes it so that the model is more likely to provide an accurate label because it thought about the answer first .
For human labelers, this is a question that came up in chat about who is doing the human labeling. So I'll get to this in a minute. But we actually have a way to label the data inside of the Arize platform. We have these annotation queues that people can use . So you actually never even have to leave Arize in order to set up this full pipeline, although for the project we also used Labelbox .
So this like kind of goes over what a data flywheel would look like in this context . So there are two flywheels, and we'll mostly be focusing on the one on the right . So on the left, there's kind of this loop to continuously improve your production agent. So that would be like the Chatbot that you have in production . And the idea is that you're continuously collecting data from this Chatbot, running these evals with LLMs, a judge. Those are the labels that you just saw adding that to a data set, and then either fine-tuning or optimizing the prompt for this model so that you can continuously improve your agent .
Now, on the right, this is what we're going to be focusing on. There's a flywheel to continuously improve the judge. So that would be like the LibreEval model that we were just discussing earlier . So with this loop, you can identify the failure modes for the LLM Judge. So these would be examples where maybe multiple judges disagree or there's a low confidence score associated with the label . Then you can send those to human labelers, and then you can use those examples to fine-tune your LLM Judge or optimize the prompt . And then after that you can review the experiments with the fine-tuned model and then redeploy. So the idea is like this loop will just be happening continuously . So the LLM Judge just gets better and better over time on its own .
In this slide, I'm just going to go over how you can identify examples that are challenging for an LLM Judge . So one thing that you can do that's fairly low cost is use the log probabilities that are output by the model to compute a confidence score, especially with fine-tuned models unlike closed-source models . You'll always have access to those log probabilities, but you can do some quick computations to get like essentially a confidence score about how confident the model is in the label it's providing . So in this example, we can see that the hallucination label was provided with a 55% confidence score, which is pretty low . So this would be a good candidate to add to a data set for human labeling so that a human can determine what the ground truth should have been .
And this is what the labeling queues look like in Arize. So you can take any team of annotators and set them up to basically annotate every single example that comes into a specific data set . You can have multiple annotators annotating each record, and they'll basically provide whether or not they think this example was hallucinated or factual, and they can also add other information like notes about explaining why they think this is the case, and all of this data can later be used to actually fine-tune and improve the LLM .
And this is an example of like the sorts of integrations that we'll be doing in Arize for the fine-tuning part . So here is an example of what the integration would look like with Nvidia Nemo, which is one of our partners here . You can specify a training and validation and test set and use that data set that you just labeled with human annotations to in a purely UI flow continually fine-tune this fine model that was produced in Libra Eval . You can specify the batch size, the learning rate, and other parameters all here in a UI, and this enables fairly non-technical people to also contribute to this workflow of improving LLMs through fine-tuning .
And then at the end, once you kick off these fine-tuning jobs, you'd be able to see the results as an experiment in our UI. So we're really trying to make this a purely UI workflow. And here what we're looking at is we see the original output from the LLM Judge before that fine-tuning job. Then the new output from the LLM Judge after the fine-tuning job. And these aren't all the hard examples that we were looking at earlier, so examples where the judge wasn't very confident initially, and we can see how this compares with the human annotations . And ideally, we want the labels between the LLM Judge to match what the human annotation labels are . And here we can see that fine-tuning this model results in labels that are more aligned with human, which means our model is getting better and better over time with these fine-tuning jobs.
John Gilhuly: So I think that is all we have in terms of content today. If there are any other questions, feel free to throw them into the chat here. I think this is something that we've obviously been working on for a long time, and if you guys have been joining in for any of our other content over the past few weeks, and some of the stuff we have upcoming, starting to talk a lot more about kind of optimization and continuous improvement and some of these automated improvement techniques. So very excited to see some of this kind of coming into the product as well and excited to be able to share some of this research that we've done with you . Please check out the site that we had as well, and the paper on there. We're also going to be hosting up the fine-tuned model that we have on Together AI, so it'll be accessible through Together in the next few days .
...and we'll be able to have that available for you . So if you want to play around with the model or anything like that, then you are welcome to check it out, and we would love any kind of feedback you have . Again, we kind of see this as the baseline that we're going to use for this continuous fine-tuning that Julia talked through there .
And yet, there's a question of whether we'll share the recording. We're going to share the recording by email, and also within our community Slack. So if you want to join our community Slack, feel free to scan the code on the right there and then we're all in there, and you can ask any follow-up questions that you might have and just kind of follow along with us as we continue to improve these .
And with that, Julia, any last thoughts, anything else you want to say before we wrap it up here ?
Julia Gomes: No, I think that's a great summary, and I really appreciate all the conversations in the chat .
John Gilhuly: Thanks everybody for the questions, and we'll see you in a couple of weeks for our next paper reading.