Data Processing

Let’s take a look at the data schema of CSV again. Its structure is

ticket_number,date,caller,application,query,responder,solution,timestamp

In this business scenario, we possess a substantial volume of data from call center logs, specifically related to IT support. This data is stored in a traditional SQL server and is exported into CSV or Microsoft Excel formats. The data is:

  • Well-structured, divided into specific columns
  • The designated column for embedding and storing in a vector database includes the call resolution. Not all columns will be embedded. For instance, querying the caller’s name in the vector database is unnecessary. This decision is also pivotal from a performance standpoint when dealing with extensive data volumes.
  • Neither too long nor too short in length. In theory, there will be no issues with token size for language model, as each column’s width is predetermined, similar to the data length in the example.
  • In instances where the solution column becomes excessively long due to operators submitting lengthy documents as solutions, we can truncate the column and provide a reference to the full solution in the original database. This way, users can access the complete solution later if needed.

Load the CSV and Define page_content for Embedding

Based on the data schema outlined, the call log record encompasses various columns, such as the caller’s name and the timestamp of the call. These elements are not relevant to the call solution and, consequently, embedding them into a vectorstore can lead to increased computational demands and a decrease in search accuracy. Therefore, it is advisable to focus on embedding only the columns that are important for the end user’s query, namely the query itself, the application name, and the relevant solution. These are the key pieces of information that users are interested in retrieving.

Below is a code snippet demonstrating how to extract and embed the three specified columns - query, application name, and solution - from a CSV file.

from pprint import pprint

from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path="./reranker_sample.csv",
                   metadata_columns=["ticket_number", "date", "caller",
                                     "responder", "timestamp"])
data = loader.load()
pprint(data[0])

The output appears to be

Document(metadata={'source': './reranker_sample.csv', 'row': 0, 'ticket_number': '000001', 'date': '2024-04-01', 'caller': 'John Doe', 'responder': 'Jane Smith', 'timestamp': '2024-04-01 10:15:23'}, page_content="application: Windows Operating System\nquery: I'm having trouble logging into my account. I keep getting an error message saying my password is incorrect, even though I know I'm entering it correctly.\nsolution: Verified the user's account information and reset their password. Provided step-by-step instructions on how to log in with the new password.")

The output reveals that the page_content of the Document contains information relevant to the solution, including application name, query of the call, and solution, while the ticket_number, date, and caller are categorized under metadata. The page_content will be embedded and then inserted into the vector store. source within metadata, by default, contains the original document name and path, which indicates the source of the information.