After uploading content to the knowledge base, it needs to undergo chunking and data cleaning. This stage can be understood as content preprocessing and structuring.
What is text chunking and cleaning
Chunking: LLMs have a limited context window, usually requiring the entire text to be segmented and then recalling the most relevant segments to the user’s question, known as the segment TopK recall mode. Additionally, appropriate segment sizes help match the most relevant text content and reduce information noise when semantically matching user questions with text segments.
Cleaning: To ensure the quality of text recall, it is usually necessary to clean the data before passing it into the model. For example, unwanted characters or blank lines in the output may affect the quality of the response. To help users solve this problem, Vord provides various cleaning methods to help clean the output before sending it to downstream applications, check ETL to know more details.
Two strategies are supported:
Automatic mode
Custom mode
Automatic
The Automated mode is designed for users unfamiliar with segmentation and preprocessing techniques. In this mode, Vord automatically segments and sanitizes content files, streamlining the document preparation process.
Custom
Custom mode is tailored for advanced users with specific text processing requirements. This mode allows manual configuration of chunking rules and cleaning strategies based on different document formats and scenario demands.
Chunking Rules:
Delimiter: Specify a delimiter for text segmentation. For example, (newline character in regex) will chunk text at each line break.
Maximum chunk length: Set the maximum character count per segment. Chunk exceeding this limit will be forcibly divided. The maximum length for a segment is 4000 tokens.
Chunk overlap: Define the overlap between adjacent chunks. This overlap enhances information retention and analysis accuracy, improving recall effectiveness. Recommended setting is 10-25% of the segment length in tokens.
Text Preprocessing Rules: These rules help filter out insignificant content from the knowledge base.
Replace consecutive spaces, newlines, and tabs.
Delete all URLs and email addresses.mode
Indexing Mode
You need to choose the indexing method for the text to specify the data matching method. The indexing strategy is often related to the retrieval method, and you need to choose the appropriate retrieval settings according to the scenario.
High-Quality Mode
Economical Mode
Q&A Mode
High QualityEconomicalQ&A Mode (community version only)
In High-Quality mode, the system first leverages a configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.
The High-Quality indexing method offers three retrieval settings: vector retrieval, full-text retrieval, and hybrid retrieval. For more details on retrieval settings, please check "Retrieval Settings".
Retrieval Settings
In high-quality indexing mode, Vord offers three retrieval settings:
Vector Search
Full-Text Search
Hybrid Search
Vector Search
Definition: The system vectorizes the user's input query to generate a query vector. It then computes the distance between this query vector and the text vectors in the knowledge base to identify the most semantically proximate text chunks.
Vector Search Settings:
Rerank Model: After configuring the API key for the Rerank model on the "Model Provider" page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.
TopK: This parameter filters the text chunks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chunks. When enabled, the default value is 0.5. A higher value is likely to yield fewer recalled texts.
The TopK and Score configurations are only effective during the Rerank phase. Therefore, to apply either of these settings, it is necessary to add and enable a Rerank model.
Full-Text Search
Definition: Indexing all terms in the document, allowing users to query any terms and return text fragments containing those terms.
Rerank Model: After configuring the API key for the Rerank model on the "Model Provider" page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.
TopK: This parameter filters the text chunks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chunks. When enabled, the default value is 0.5. A higher value is likely to yield fewer recalled texts.
The TopK and Score configurations are only effective during the Rerank phase. Therefore, to apply either of these settings, it is necessary to add and enable a Rerank model.
Hybrid Search
Definition: This process performs both full-text search and vector search simultaneously, incorporating a reordering step to select the best results that match the user's query from both types of search outcomes. In this mode, users can specify "weight settings" without needing to configure the Rerank model API, or they can opt for a Rerank model for retrieval.
Weight Settings: This feature enables users to set custom weights for semantic priority and keyword priority. Keyword search refers to performing a full-text search within the knowledge base, while semantic search involves vector search within the knowledge base.
Semantic Value of 1
This activates only the semantic search mode. Utilizing embedding models, even if the exact terms from the query do not appear in the knowledge base, the search can delve deeper by calculating vector distances, thus returning relevant content. Additionally, when dealing with multilingual content, semantic search can capture meaning across different languages, providing more accurate cross-language search results.
Keyword Value of 1
This activates only the keyword search mode. It performs a full match against the input text in the knowledge base, suitable for scenarios where the user knows the exact information or terminology. This approach consumes fewer computational resources and is ideal for quick searches within a large document knowledge base.
Custom Keyword and Semantic Weights
In addition to enabling only semantic search or keyword search, we provide flexible custom weight settings. You can continuously adjust the weights of the two methods to identify the optimal weight ratio that suits your business scenario.
Rerank Model: After configuring the API key for the Rerank model on the "Model Provider" page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.
The "Weight Settings" and "Rerank Model" settings support the following options:
TopK: This parameter filters the text chunks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chunks. When enabled, the default value is 0.5. A higher value is likely to yield fewer recalled texts.
In the Economical indexing mode, Vord offers a single retrieval setting:
Inverted Index:
An inverted index is an index structure designed for rapid keyword retrieval in documents. Its fundamental principle involves mapping keywords from documents to lists of documents containing those keywords, thereby enhancing search efficiency. For a detailed explanation of the underlying mechanism, please refer to the "Inverted Index".
TopK:
This parameter filters the text chunks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Reference
Optional ETL Configuration
In production-level applications of RAG, to achieve better data recall, multi-source data needs to be preprocessed and cleaned, i.e., ETL (extract, transform, load). To enhance the preprocessing capabilities of unstructured/semi-structured data, Vord supports optional ETL solutions: Vord ETL and Unstructured ETL.
Unstructured can efficiently extract and transform your data into clean data for subsequent steps.
ETL solution choices in different versions of Vord:
The SaaS version defaults to using Unstructured ETL and cannot be changed;
The community version defaults to using Vord ETL but can enable Unstructured ETL through environment variables;
Differences in supported file formats for parsing:
Different ETL solutions may have differences in file extraction effects. For more information on Unstructured ETL’s data processing methods, please refer to the official documentation.
Embedding Model
Embedding transforms discrete variables (words, sentences, documents) into continuous vector representations, mapping high-dimensional data to lower-dimensional spaces. This technique preserves crucial semantic information while reducing dimensionality, enhancing content retrieval efficiency.
Embedding models, specialized large language models, excel at converting text into dense numerical vectors, effectively capturing semantic nuances for improved data processing and analysis.