Duplicate Question Detection in Stack Overflow: A Reproducibility Study

This page provides companion data for the paper submitted to SANER'2018 - RENE Track. This data is intended to make our results reproducible.

Authors

Rodrigo Fernandes Gomes da Silva
Klérisson Vinícius Ribeiro Paixão
Marcelo de Almeida Maia

Source Code

We make publicly available the source code for our reproductions of DupPredictor and Dupe called DupPredictorRep and DupeRep. Follow the above links for detailed steps of the reproduction. 

Dataset

We provide two dumps, both containing the main tables. They differ only in the table "posts". In Dump 1, the table data is stemmed and had the stop words removed. Also it has the synonyms of tags and code blocks already extracted. In Dump 2the table contains the original raw content. The fastest way to reproduce DupPredictor or Dupe is using Dump 1. If you desire to run the entire process, including the stemming and stop words removal, follow the instructions available in the preprocess step for stemming and removing the stop words.