Populating a vector database typically involves sourcing content, transforming it into a vector representation, and storing it in a searchable format.
Data Cleaning and Preprocessing #
To store data in a vector database, you must first convert it into a numerical vector. Cyclr performs embedding by integrating with external models via its Connectors:
- Use an embedding service to convert your input text into a high-dimensional vector.
- For example, with the ChatGPT connector, you can call the “Create Embedding” method to pass a text string. The resulting output is a vector, often with dimensions defined by the model (e.g. OpenAI’s text-embedding-3-small returns vectors with 1536 dimensions).
- Use OCR (Optical Character Recognition) to extract text from PDFs.
- In this example, we built a custom connector for MistralAI to process text through OCR and pass it to external models.
- Convert extracted text to Markdown or plain text for consistency.
For example, a PDF file might be extracted using OCR, converted into markdown, and then embedded via OpenAI. The resulting vector is then passed to the vector database.
Always refer to your embedding provider’s documentation to verify the expected output format. Ensure consistency between the embedding model used during data ingestion and querying. Other providers and custom models can also be used, as long as they return compatible vector formats.
Vector Upsertion #
Vector upsertion is the process of adding or updating vector records in a database. These records may typically include:
- A unique identifier
- The embedding vector itself (a high-dimensional array)
- Optional metadata for filtering or context
In Cyclr, upsertion is handled via connector methods that map incoming data from source systems to the required format for your vector database. For example, the Cyclr Pinecone connector includes methods like:
- Upsert Vectors: Store one or more vectors in a specified namespace*
- Upsert Text: Embed and store text using integrated models (if supported) into a namespace*
- Delete Vectors, Update Vector, List Vector IDs: Manage vector records
*A namespace in this context refers to a logical partition within a vector database index. It is used to isolate groups of vectors under a shared identifier, allowing for targeted queries, scoped data management, and organization of content. When you upsert vectors or perform searches, specifying a namespace ensures that operations are confined to that partition.
Watch: How to Ingest and Store Vectors from Sheets Using Pinecone (Video Walkthrough – Episode 2)
Workflow Orchestration #
The ingestion process can be orchestrated as a Cyclr workflow. A workflow might, for example:
- Retrieve data, e.g. a document from Google Drive or rows from Google Sheets
- Call an embedding service for each content item
- Map the output into a vector record
- Upsert the result into the database
These workflows can be scheduled, triggered by events such as new uploads, or run manually.