MEDRED | Extracting medical entities from social media

Being able to analyze health discussions from social media offers promising possibilities for public health monitoring, adverse drug reactions detection, or identifying and supporting individuals and communities with mental health problems. However, accurately extracting health mentions from social media is challenging because people use informal language with different expressions for the same concept, and they also make spelling mistakes.

In this work, we demonstrated how to accurately extract a wide variety of medical entities such as symptoms, diseases, and drug names on three benchmark datasets from varied social media sources. Besides, we created an additional benchmark dataset by annotating medical entities in 2K Reddit posts (made publicly available under the name of MedRed).

Taking half a million Reddit posts from 18 disease-specific subreddits, we trained a machine-learning classifier to predict the post's category solely from the extracted medical entities. The average F1 score across categories was .87. These results open up new cost-effective opportunities for modeling, tracking and even predicting health behavior at scale.

Publications

  • Sanja Šćepanović, Enrique Martin-Lopez, and Daniele Quercia.
    MedRed Dataset.
    Figshare (2020).

Data

  • Download models, data, and analysis code : From here