Advanced configuration

After uploading content to the knowledge base, it needs to undergo chunking and data cleaning. This stage can be understood as content preprocessing and structuring.

What is text chunking and cleaning

Chunking: LLMs have a limited context window, usually requiring the entire text to be segmented and then recalling the most relevant segments to the user’s question, known as the segment TopK recall mode. Additionally, appropriate segment sizes help match the most relevant text content and reduce information noise when semantically matching user questions with text segments.

Cleaning: To ensure the quality of text recall, it is usually necessary to clean the data before passing it into the model. For example, unwanted characters or blank lines in the output may affect the quality of the response. To help users solve this problem, Vord provides various cleaning methods to help clean the output before sending it to downstream applications, check ETL to know more details.

Two strategies are supported:

  • Automatic mode

  • Custom mode

Automatic

The Automated mode is designed for users unfamiliar with segmentation and preprocessing techniques. In this mode, Vord automatically segments and sanitizes content files, streamlining the document preparation process.


Indexing Mode

You need to choose the indexing method for the text to specify the data matching method. The indexing strategy is often related to the retrieval method, and you need to choose the appropriate retrieval settings according to the scenario.

  • High-Quality Mode

  • Economical Mode

  • Q&A Mode

High QualityEconomicalQ&A Mode (community version only)

In High-Quality mode, the system first leverages a configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.

The High-Quality indexing method offers three retrieval settings: vector retrieval, full-text retrieval, and hybrid retrieval. For more details on retrieval settings, please check "Retrieval Settings".


Retrieval Settings

In high-quality indexing mode, Vord offers three retrieval settings:

  • Vector Search

  • Full-Text Search

  • Hybrid Search

In the Economical indexing mode, Vord offers a single retrieval setting:

Inverted Index:

An inverted index is an index structure designed for rapid keyword retrieval in documents. Its fundamental principle involves mapping keywords from documents to lists of documents containing those keywords, thereby enhancing search efficiency. For a detailed explanation of the underlying mechanism, please refer to the "Inverted Index".

TopK:

This parameter filters the text chunks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.


Reference

Optional ETL Configuration

In production-level applications of RAG, to achieve better data recall, multi-source data needs to be preprocessed and cleaned, i.e., ETL (extract, transform, load). To enhance the preprocessing capabilities of unstructured/semi-structured data, Vord supports optional ETL solutions: Vord ETL and Unstructured ETL.

Unstructured can efficiently extract and transform your data into clean data for subsequent steps.

ETL solution choices in different versions of Vord:

  • The SaaS version defaults to using Unstructured ETL and cannot be changed;

  • The community version defaults to using Vord ETL but can enable Unstructured ETL through environment variables;

Differences in supported file formats for parsing:

VORD ETLUnstructured ETL

txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv

txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv, eml, msg, pptx, ppt, xml, epub

Different ETL solutions may have differences in file extraction effects. For more information on Unstructured ETL’s data processing methods, please refer to the official documentation.

Embedding Model

Embedding transforms discrete variables (words, sentences, documents) into continuous vector representations, mapping high-dimensional data to lower-dimensional spaces. This technique preserves crucial semantic information while reducing dimensionality, enhancing content retrieval efficiency.

Embedding models, specialized large language models, excel at converting text into dense numerical vectors, effectively capturing semantic nuances for improved data processing and analysis.

Last updated