Pursuing to excel in algorithmic roles, esp in LMMs application.
2024 Fall students for MS in Computer Science at University of Wisconsin - Madison.
View My LinkedIn Profile
Project Description:
This project aimed to revolutionize the way product titles are matched with category trees by integrating NLP techniques with Retrieval-Augmented Generation (RAG).
Objective:
while analyze competitor’s products, it is neccessary to map the product to the internal categorization, which is crucial for competitor analysis.
while uploading new product from our supplies, with this enhanced method to find product category can increase the accuracy and efficiency of product information, which are crucial for inventory management.
Accuracy: 96%+
Steps:
Tokenization and preprocessing. for example, performing stemming or lemmatization.
IDF Calculation: Calculate the inverse document frequency for each term.
TF-IDF Calculation:Calculate the TF-IDF score for each term in the product title based on TF and IDF values.
TF-IDF Vectorization:Construct a TF-IDF vector for the product title based on the calculated TF-IDF scores.
Cosine Similarity Calculation:Calculate the cosine similarity between the TF-IDF vector of the product title and the category vectors in the hierarchical tree.
Prediction: Based on the highest cosine similarity score
Achieve 80% accuracy in top 5 candidates. Typical badcases fails to capture contextual information of product and category.
Steps:
raise several chunk strategies:
use product title as bridge to connect category with new product titles.
first, stem and simplify product title only keep main info, then do the step above.
directly use category jsonl as chunk then prompt choose from candidate. [ best performance and lowest cost] above
In the multi-classification tasks involving long prompts, the language model (e.g., GPT) often generated hallucinated or irrelevant responses. This issue was particularly challenging as it compromised the accuracy and reliability of the product categorization.
Addressing Hallucination: To mitigate this issue, we integrated Retrieval-Augmented Generation (RAG) and utilized cosine similarity measures. This approach helped narrow down the classification candidates, effectively reducing the potential for hallucination by focusing the language model’s responses on more probable and relevant categories.