Health Forum Dataset Release: 1.0 (15/06/2014) _______________________________________________________________________________ Data : Paper: The data has been used in the following paper: People on Drugs: Credibility of User Statements in Health Communities Subhabrata Mukherjee, Gerhard Weikum and Cristian Danescu-Niculescu-Mizil In KDD 2014 Contact: _______________________________________________________________________________ ---- FILES ---- - README.txt: This README file. - Each data file is tab-delimited. For each file, the following details are shown below : -- source of the data -- usage (input to the system, ground-truth, extracted by the IE machinery) -- schema -- example record from each file ---- DATA DESCRIPTION AND FORMAT ---- 1. Author-Drug-Docs.tsv (15,000 authors, 837 drug families with 2172 drugs, 6,20,510 docs) -- source : (as of 13.12.2013) -- usage : input -- schema : author-id (tab) drug (tab) doc-id-list -- example record : 20041 mersyndol [180917, 84869, 428058, 428080] --------------------- 2. Author-Details.tsv -- source : (as of 13.12.2013) -- usage : input -- schema : author-id (tab) gender (tab) location (tab) #posts (tab) membership-type (tab) #questions (tab) #replies (tab) #thanks -- example record : 130 female Birmingham, AL 12921 Facilitator 114 8154 817 --------------------- 3. Author-Doc-Review.tsv (2.8 million posts of 15,000 authors in 6,20,510 docs) -- source : (as of 13.12.2013) -- usage : input -- schema : author-id (tab) doc-id (tab) post -- example record : 110550 118288 Hello everyone, hope all is as well as can be. I was dx'd 1-20-06 with MS. Starting testing less than 2 months ago.... --------------------- 4. Author-Doc-Symptoms.tsv (2.8 million posts of 15,000 authors in 6,20,510 docs) -- usage : UMLS is used to extract biomedical concepts and terms in each sentence of the post. The extractions from individual sentences are delimited by #. Each extraction is of the form "word-detected:UMLS-category". Presence of negation (no, not, neither, nor) before any concept, within a window of 5, is indicated by the concept being prefix-ed by "neg:". These are used to extract frequently occuring n-gram patterns (max n = 5), that overlap with any expert-given side-effect (for any drug) with similarity score greater than a threshold, as potential side-effects used in experimentation. -- source: (as of 13.12.2013) -- schema : author-id (tab) doc-id (tab) symptoms-sentence-delimited-by-# -- example record : 3 10022 crap:body substance # neg:waste:disease or syndrome neg:money:idea or concept # ill:sign or symptom weeks:temporal concept # --------------------- 5. Expert-Drug-SideEffects.tsv (837 drug families with 2172 drugs) -- usage : Ground-truth extracted from the MayoClinic portal. SideEffect-Category can be "More Common, Less Common, Rare, Unknown or Overdose" -- source : (as of 20.1.2014) -- schema : drug-family (tab) SideEffect-Category (tab) Side-Effect-List (tab) UMLS-Concept-Mapping-Each-Item-delimited-by-# -- example record : amlodipine, norvasc Less Common [Difficult or labored breathing, dizziness ...] [difficult:Qualitative Conceptlabored:Occupational Activitybreathing:Organism Function # dizziness:Sign or Symptom ...] --------------------- 6. Specific Drug Side Effects used in Experimentation for 6 drugs -- usage : These files are used to evaluate the performance of the models in identifying the true side-effects of 6 drugs from user posts. Each file on a given drug contains all posts made in the community involving that specific drug. The file also contains different side-effects (true or false) detected in the user post, along with their frequency in each post. The expert-reported side-effects about the given drug in Expert-Drug-SideEffects.tsv is used to find out the true side-effects of the drug in the given user post (using similarity scores > threshold). -- Author-Xanax-SideEffects.tsv (Drugs = alprazolam, niravam, xanax ; Authors = 2785 ; Posts = 21,112) -- Author-Ibuprofen-SideEffects.tsv (Drugs = ibuprofen, advil, genpril, motrin, midol, nuprin ; Authors = 5657 ; Posts = 15,573) -- Author-Prilosec-SideEffects.tsv (Drugs = omeprazole, prilosec ; Authors = 1061 ; Posts = 3884) -- Author-Metformin-SideEffects.tsv (Drugs = metformin, glucophage, glumetza, sulfonylurea ; Authors = 779 ; Posts = 3562) -- Author-Tirosint-SideEffects.tsv (Drugs = levothyroxine, tirosint ; Authors = 432 ; Posts = 2393) -- Author-Flagyl-SideEffects.tsv (Drugs = metronidazole, flagyl ; Authors = 492 ; Posts = 1559) -- Each file is of the form -- schema : author-id (tab) doc-id (tab) symptom-list-with-frequency-in-post -- example record : 17483 426006 {change=2, pain low leg=2, pain=11, pain surgery=1, pain tenderness upp abdoman stomach=3, back pain severe=2, hear=2, upp stomach pain=2, heal=1, nausea=2, back leg pain=1, false sense well-be=2, feel=7, think=3, pain hairy area=1, anxiety=1, speak=1, problem=2, sleep=2, depression=1, gas=3, stomach pain severe=2, tenderness stomach area=2, severe nausea=3, stomach pain gas=1, attack=2, muscle ach pain=1, stomach pain severe nausea vomit=2, thing=1} --------------------- 7. Stylistic-Features.txt -- usage : As stylistic features in the models --------------------- 8. Affective-Features.tsv -- usage : As affective features in the models. For the same affective feature category, different lines indicate different word senses of the feature. However, the senses are not marked in the file. -- source : -- schema : affective-category affective-word-list -- example record : abhorrence abhorrence, abomination, detestation, execration, loathing, odium _______________________________________________________________________________ BibTex Entry: @InProceedings{mukherjee2014peopleondrugs, author = {Subhabrata Mukherjee, Gerhard Weikum and Cristian Danescu-Niculescu-Mizil}, title = {People on Drugs: Credibility of User Statements in Health Communities}, booktitle = {Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14)}, year = {2014}, }