Problem
Over 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question and make writers feel they need to answer multiple versions of the same question.
Quora already uses a Random Forest Classifier, so a better algorithm would be important to improve performance.
Solution
This project aims to model a classifier that can detect if a question is duplicated or if the solution to the same question already exists. I Modeled a Random Forest Classifier and compare the results against DNN using NLP techniques.
Process
- Download the data and load. The labeled dataset can be downloaded from here. The dataset consisted of two questions and a label to indicate whether the question is duplicated or not.
- Data Cleaning. The following operations were performed to clean the text in the data.
- Removing punctuations
- Removing whitespace
- Removing stop words
- Removing symbols.
- Normalizing and stemming.
- EDA. The exploratory analysis focused on the shape, structure, questions, and distribution of the classes. There was a total of 255K non-duplicates and 149K duplicates. The longest question was 1.3K characters and 270 words long; the shortest question was six characters with one word. Most questions are 120 characters and 22 words long. In summary, the dataset is imbalanced and contains extreme question characteristics that should be considered during feature engineering and modeling.
- Feature Engineering Using Spacy, Fuzzywuzzy, and nltk, the final dataset for modeling contained the following features — question length, word count tfidf word_match shared count, stop ratio, shared grammar, fuzzy match, and difference in world lengths for the two questions.
- Modeling different classifiers and random forests A test sample of 100K was drawn for faster processing and computational costs.
- Traditional Machine Learning models modeled were — Logistic Regression (accuracy = 0.6755), RandomForest (accuracy = 0.6905), and Xgboost (accuracy = 0.698). Each model was evaluated in turn and tuned.
- NLP with DNN
- A NN with one hidden layer and an output layer was trained as a baseline. As I increased more layers with more activation functions and higher epochs, and lower batch sizes, a direct increase in the performance of the model was observed. The final model achieved a 71.1% accuracy.
Results.
Reflections
- Further improvement could be done by training the model with more data.
- Explore LSTM models, which could yield a higher accuracy