Ensembled Machine Learning Methods and Feature Extraction Approaches for Suicide-Related Social Media

Authors

  • Merinda Lestandy Universitas Muhammadiyah Malang
  • Abdurrahim Abdurrahim Universitas Islam Indonesia
  • Amrul Faruq Universitas Muhammadiyah Malang
  • M. Irfan Universitas Muhammadiyah Malang
  • Novendra Setyawan Universitas Muhammadiyah Malang

DOI:

https://doi.org/10.23887/janapati.v13i2.70016

Keywords:

Suicide, Media Social, Feature Extraction, Ensemble Machine Learning

Abstract

Suicide is a pressing public health concern that affects both young people and adults. The widespread use of mobile devices and social networking has facilitated the gathering of data, allowing academics to assess patterns, concepts, emotions, and opinions expressed on these platforms. This study is to detect suicidal inclinations using Reddit online dataset. It allows for the identification of people who express thoughts of suicide by analyzing their postings. The method addresses and evaluates different machine learning classification models, namely linear SVC, random forest, and ensemble learning, along with feature extraction approaches such as TF-IDF, Bag of Words, and VADER.   This study utilised a voting classifier in our ensemble model, where the projected class output is selected by the class with the highest probability. This approach, typically known as a "voting classifier," employs voting to forecast results. The results collected suggest that employing ensemble learning with the TF-IDF 2-grams approach yields the highest F1-score, specifically 0.9315. The efficacy of TF-IDF 2-grams can be determined to their capacity to capture a greater amount of contextual information and maintain the order of words.

References

World Health Organization, Suicide worldwide in 2019: Global Health Estimates. 2019. [Online]. Available: https://apps.who.int/iris/rest/bitstreams/1350975/retrieve

A. Z. Ivey-Stephenson et al., “Suicidal Ideation and Behaviors Among High School Students - Youth Risk Behavior Survey, United States, 2019.,” MMWR Suppl., vol. 69, no. 1, pp. 47–55, Aug. 2020, doi: 10.15585/mmwr.su6901a6.

M. A. Reger, I. H. Stanley, and T. E. Joiner, “Suicide Mortality and Coronavirus Disease 2019—A Perfect Storm?,” JAMA Psychiatry, vol. 77, no. 11, pp. 1093–1094, Nov. 2020, doi: 10.1001/jamapsychiatry.2020.1060.

J. C. Franklin et al., “Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research.,” Psychol. Bull., vol. 143, no. 2, pp. 187–232, Feb. 2017, doi: 10.1037/bul0000084.

G. Castillo-Sánchez, G. Marques, E. Dorronzoro, O. Rivera-Romero, M. Franco-Martín, and I. De la Torre-Díez, “Suicide Risk Assessment Using Machine Learning and Social Networks: a Scoping Review.,” J. Med. Syst., vol. 44, no. 12, p. 205, Nov. 2020, doi: 10.1007/s10916-020-01669-5.

E. Yeskuatov, S.-L. Chua, and L. K. Foo, “Leveraging Reddit for Suicidal Ideation Detection: A Review of Machine Learning and Natural Language Processing Techniques.,” Int. J. Environ. Res. Public Health, vol. 19, no. 16, Aug. 2022, doi: 10.3390/ijerph191610347.

A. N. Weber, M. Michail, A. Thompson, and J. G. Fiedorowicz, “Psychiatric Emergencies: Assessing and Managing Suicidal Ideation,” Med. Clin. North Am., vol. 101, no. 3, pp. 553–571, 2017, doi: https://doi.org/10.1016/j.mcna.2016.12.006.

S. T. Rabani, Q. R. Khan, and A. M. U. D. Khanday, “Quantifying Suicidal Ideation on Social Media using Machine Learning: A Critical Review,” Iraqi J. Sci., vol. 62, no. 11, pp. 4092–4100, 2021, doi: 10.24996/ijs.2021.62.11.29.

M. Gaur et al., “Characterization of time-variant and timeinvariant assessment of suicidality on Reddit using C-SSRS,” PLoS One, vol. 16, no. 5 May 2021, pp. 1–21, 2021, doi: 10.1371/journal.pone.0250448.

T. M. DeJong, J. C. Overholser, and C. A. Stockmeier, “Apples to oranges?: A direct comparison between suicide attempters and suicide completers,” J. Affect. Disord., vol. 124, no. 1, pp. 90–97, 2010, doi: https://doi.org/10.1016/j.jad.2009.10.020.

M. Marks, “Artificial Intelligence-based Suicide Prediction,” Yale J. Heal. Policy, Law Ethics, no. 1003774, p. 4, 2019, [Online]. Available: https://ssrn.com/abstract=3324874%0Ahttps://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8698029

G. B. Colombo, P. Burnap, A. Hodorog, and J. Scourfield, “Analysing the connectivity and communication of suicidal users on twitter,” Comput. Commun., vol. 73, pp. 291–300, 2016, doi: https://doi.org/10.1016/j.comcom.2015.07.018.

K. Daine, K. Hawton, V. Singaravelu, A. Stewart, S. Simkin, and P. Montgomery, “The Power of the Web: A Systematic Review of Studies of the Influence of the Internet on Self-Harm and Suicide in Young People,” PLoS One, vol. 8, no. 10, p. e77555, Oct. 2013, [Online]. Available: https://doi.org/10.1371/journal.pone.0077555

R. A. Fahey, J. Boo, and M. Ueda, “Covariance in diurnal patterns of suicide-related expressions on Twitter and recorded suicide deaths.,” Soc. Sci. Med., vol. 253, p. 112960, May 2020, doi: 10.1016/j.socscimed.2020.112960.

S. Ji, S. Pan, X. Li, E. Cambria, G. Long, and Z. Huang, “Suicidal Ideation Detection: A Review of Machine Learning Methods and Applications,” IEEE Trans. Comput. Soc. Syst., vol. 8, no. 1, pp. 214–226, 2021, doi: 10.1109/TCSS.2020.3021467.

B. Desmet and V. Hoste, “Emotion detection in suicide notes,” Expert Syst. Appl., vol. 40, no. 16, pp. 6351–6358, 2013, doi: https://doi.org/10.1016/j.eswa.2013.05.050.

J. Jashinsky et al., “Tracking Suicide Risk Factors Through Twitter in the US,” Crisis, vol. 35, no. 1, pp. 51–59, Jan. 2014, doi: 10.1027/0227-5910/a000234.

G. Coppersmith, R. Leary, P. Crutchley, and A. Fine, “Natural Language Processing of Social Media as Screening for Suicide Risk,” Biomed. Inform. Insights, vol. 10, p. 1178222618792860, Jan. 2018, doi: 10.1177/1178222618792860.

W.-C. Chiang, P.-H. Cheng, M.-J. Su, H.-S. Chen, S.-W. Wu, and J.-K. Lin, “Socio-health with personal mental health records: Suicidal-tendency observation system on Facebook for Taiwanese adolescents and young adults,” in 2011 IEEE 13th International Conference on e-Health Networking, Applications and Services, 2011, pp. 46–51. doi: 10.1109/HEALTH.2011.6026784.

K. Lehavot, D. Ben-Zeev, and R. E. Neville, “Ethical Considerations and Social Media: A Case of Suicidal Postings on Facebook,” J. Dual Diagn., vol. 8, no. 4, pp. 341–346, Nov. 2012, doi: 10.1080/15504263.2012.718928.

L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140, 1996, doi: 10.1007/BF00058655.

M. Moon and K. Nakai, “Stable feature selection based on the ensemble L 1-norm support vector machine for biomarker discovery,” BMC Genomics, vol. 17, no. Suppl 13, 2016, doi: 10.1186/s12864-016-3320-z.

A. K. Uysal and S. Gunal, “The impact of preprocessing on text classification,” Inf. Process. Manag., vol. 50, no. 1, pp. 104–112, 2014, doi: 10.1016/j.ipm.2013.08.006.

A. Jakhotiya et al., “Text Pre-Processing Techniques in Natural Language Processing: A Review,” Int. Res. J. Eng. Technol., pp. 878–880, 2022, [Online]. Available: www.irjet.net

S. Sarica and J. Luo, “Stopwords in technical language processing,” PLoS One, vol. 16, no. 8 August, pp. 1–13, 2021, doi: 10.1371/journal.pone.0254937.

J. E. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries,” 2003. [Online]. Available: https://api.semanticscholar.org/CorpusID:14638345

M. Das, S. Kamalanathan, and P. Alphonse, “A Comparative Study on TF-IDF feature Weighting Method and its Analysis using Unstructured Dataset,” vol. 5571, pp. 0–2, 2021.

A. Stephen, T. Lubem, and I. Adom, “Comparing Bag of Words and TF-IDF with different models for hate speech detection from live tweets,” Int. J. Inf. Technol., vol. 14, Sep. 2022, doi: 10.1007/s41870-022-01096-4.

M. Chiny, M. Chihab, Y. Chihab, and O. Bencharef, “LSTM, VADER and TF-IDF based Hybrid Sentiment Analysis Model,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 7, pp. 265–275, 2021, doi: 10.14569/IJACSA.2021.0120730.

J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classification: Applications, challenges and trends,” Neurocomputing, vol. 408, pp. 189–215, 2020, doi: https://doi.org/10.1016/j.neucom.2019.10.118.

L. R. Krosuri, R. Satish, E. Learning, S. Lin, S. Fan, and J. Yao, “Social Sentiment Analysis Using Classifiers and Ensemble Learning Social Sentiment Analysis Using Classifiers and Ensemble Learning,” 2019, doi: 10.1088/1742-6596/1237/2/022193.

L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

B. Zadrozny and C. Elkan, “Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,” Icml, pp. 1–8, 2001, [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.29.3039&rep=rep1&type=pdf

P. Cunningham and S. Delany, “k-Nearest neighbour classifiers,” Mult Classif Syst, vol. 54, Apr. 2007, doi: 10.1145/3459665.

P. D. Caie, N. Dimitriou, and O. Arandjelović, “Precision medicine in digital pathology via image analysis and machine learning,” Artif. Intell. Deep Learn. Pathol., pp. 149–173, 2020, doi: 10.1016/B978-0-323-67538-3.00008-7.

I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, “Hate speech detection in the Indonesian language: A dataset and preliminary study,” in 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2017, pp. 233–238. doi: 10.1109/ICACSIS.2017.8355039.

A. P. Garibay, A. T. Camacho-González, R. A. Fierro-Villaneda, I. Hernandez-Farias, D. Buscaldi, and I. V. M. Ruiz, “A Random Forest Approach for Authorship Profiling,” 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:18882361

G. Brown, “Ensemble Learning BT - Encyclopedia of Machine Learning,” C. Sammut and G. I. Webb, Eds. Boston, MA: Springer US, 2010, pp. 312–320. doi: 10.1007/978-0-387-30164-8_252.

A. Srinivas and J. P. Mosiganti, “A brain stroke detection model using soft voting based ensemble machine learning classifier,” Meas. Sensors, vol. 29, p. 100871, 2023, doi: https://doi.org/10.1016/j.measen.2023.100871.

N. Anand, D. Goyal, and T. Kumar, “Analyzing and Preprocessing the Twitter Data for Opinion Mining BT - Proceedings of International Conference on Recent Advancement on Computer and Communication,” 2018, pp. 213–221.

A. Yousefpour, R. Ibrahim, and H. N. A. Hamed, “Ordinal-based and frequency-based integration of feature selection methods for sentiment analysis,” Expert Syst. Appl., vol. 75, pp. 80–93, 2017, doi: 10.1016/j.eswa.2017.01.009.

D. Cahyani and I. Patasik, “Performance comparison of TF-IDF and Word2Vec models for emotion text classification,” Bull. Electr. Eng. Informatics, vol. 10, pp. 2780–2788, Oct. 2021, doi: 10.11591/eei.v10i5.3157.

R. Roshan, I. A. Bhacho, and S. Zai, “Comparative Analysis of TF–IDF and Hashing Vectorizer for Fake News Detection in Sindhi: A Machine Learning and Deep Learning Approach †,” Eng. Proc., vol. 46, no. 1, 2023, doi: 10.3390/engproc2023046005.

B. Jlifi, C. Sakrani, and C. Duvallet, “Towards a soft three-level voting model (Soft T-LVM) for fake news detection,” J. Intell. Inf. Syst., vol. 61, no. 1, pp. 249–269, 2023, doi: 10.1007/s10844-022-00769-7.

B. Siswoyo, Z. A. Abas, A. Naim, C. Pee, R. Komalasari, and N. Suyatna, “Ensemble machine learning algorithm optimization of bankruptcy prediction of bank,” vol. 11, no. 2, pp. 679–686, 2022, doi: 10.11591/ijai.v11.i2.pp679-686.

M. Fayaz, A. Khan, J. U. Rahman, A. Alharbi, M. I. Uddin, and B. Alouffi, “Ensemble Machine Learning Model for Classification of Spam Product Reviews,” vol. 2020, 2020.

Downloads

Published

2024-07-27

How to Cite

Merinda Lestandy, Abdurrahim Abdurrahim, Amrul Faruq, M. Irfan, & Novendra Setyawan. (2024). Ensembled Machine Learning Methods and Feature Extraction Approaches for Suicide-Related Social Media. Jurnal Nasional Pendidikan Teknik Informatika : JANAPATI, 13(2), 192–203. https://doi.org/10.23887/janapati.v13i2.70016

Issue

Section

Articles