What’s the best way to perform search queries in documents with generative AI?

There are a number of use cases that require searching for and summarizing dynamic data. Examples include news websites, which require searching for news articles, or recruitment platforms, which require searching for candidates’ CVs.

The basic way to manage this need is to take the document as is and index it. However, the resulting promo may not be the best one. Using associations, i.e vector search, which is based on seeing a familiar word, might also result in hallucinations.

Instead, it is recommended to perform data preparation: processing, structuring and keyword searching, based on the document’s structure.

Documents have a certain structure: a header, a body, tables, etc. Identify this structure. Then, determine which sections in the document highlight information you want to serve to your users and label them or add as a question. The paragraph within that section would be the answer. This way, the LLM knows this section is the relevant one. Understanding the document structure is very important, because it reduces the chances of hallucinations and saves resources.

If the data is being regularly updated in a dynamic way, it’s important to also version the data, just like in data lakes and feature stores. Be wary of rushing to build a demo and abandoning engineering practices. When versioning, it’s important to refer to which version of the data you are using in the index search with labels and keyword searches, perform rolling upgrades, conduct metadata-based filtering, and more.

Need help?

Contact our team of experts or ask a question in the community.

Have a question?

Submit your questions on machine learning and data science to get answers from out team of data scientists, ML engineers and IT leaders.