Comparative Analysis of Multinomial Naïve Bayes and Logistic Regression Models for Prediction of SMS Spam

Pradana Ananda Raharja, Muhammad Fajar Sidiq, Diandra Chika Fransisca

Abstract


This research was conducted based on a report from the United States Federal Trade Commission regarding fraud through electronic text messages via SMS that fraudsters use to manipulate potential victims. Usually, scammers spread SMS spam as an intermediary for the crime. The development of a supervised learning algorithm is applied to predict SMS spam into three categories, such as SMS spam, SMS fraud, and promotional SMS. The prediction system is dividing into several stages in the development process, including data labelling, data preprocessing, modelling, and model validation. The known accuracy based on modelling using Logistic Regression using a test size of 15% is 99%, using a test size of 20% is 99%, and using a test size of 25% is 98%. The Multinomial Naïve Bayes algorithm's accuracy with a test size of 15%, 20%, 25% is 97%. So, the SMS spam prediction approach uses the logistic regression method, which has the highest accuracy.

Keywords


Fraud; SMS Spam; Supervised Learning; Model Validation

Full Text:

PDF

References


United State of America Federal Trade Commision, “How to Recognize and Report Spam Text Messages,†Consumer Information, 2020. https://www.consumer.ftc.gov/articles/how-recognize-and-report-spam-text-messages (accessed Dec. 12, 2020).

O. S. Yee, S. Sagadevan, and N. H. A. H. Malim, “Credit Card Fraud Detection Using Machine Learning As Data Mining Technique,†J. Telecommun. Electron. Comput. Eng., vol. 10, no. 1–4, pp. 23–27, 2018.

Y. Vernanda, S. Hansun, and M. B. Kristanda, “Indonesian language email spam detection using N-gram and Naïve Bayes algorithm,†Bull. Electr. Eng. Informatics, vol. 9, no. 5, pp. 2012–2019, 2020, doi: 10.11591/eei.v9i5.2444.

M. Rifauddin and A. N. Halida, “Waspada Cybercrime dan Informasi Hoax Pada Media Sosial Facebook,†Khizanah al-Hikmah J. Ilmu Perpustakaan, Informasi, dan Kearsipan, vol. 6, no. 2, pp. 98–111, 2018, doi: 10.24252/kah.v6i2a2.

P. K. Roy, J. P. Singh, and S. Banerjee, “Deep learning to filter SMS Spam,†Futur. Gener. Comput. Syst., vol. 102, pp. 524–533, 2020, doi: 10.1016/j.future.2019.09.001.

I. Rahmawati, “Analisis Manajemen Resiko Ancaman Kejahatan Siber (Cyber Crime) Dalam Peningkatan Cyber Defense,†J. Pertahanan Bela Negara, vol. 7, no. 2, pp. 51–66, 2017, doi: 10.33172/jpbh.v7i2.193.

R. C. Perkins, C. J. Howell, C. E. Dodge, G. W. Burruss, and D. Maimon, “Malicious Spam Distribution: A Routine Activities Approach,†Deviant Behav., vol. 00, no. 00, pp. 1–17, 2020, doi: 10.1080/01639625.2020.1794269.

D. Kawade and K. Oza, “Content-Based SMS Spam Filtering Using Machine Learning Technique,†Int. J. Comput. Eng. Appl., vol. 13, no. 4, 2018.

M. Bassiouni, M. Ali, and E. A. El-Dahshan, “Ham and Spam E-Mails Classification Using Machine Learning Techniques,†J. Appl. Secur. Res., vol. 13, no. 3, pp. 315–331, 2018, doi: 10.1080/19361610.2018.1463136.

A. K. Jain, S. K. Yadav, and N. Choudhary, “A novel Approach to Detect Spam and Smishing SMS using Machine Learning Techniques,†Int. J. E-Services Mob. Appl., vol. 12, no. 1, pp. 21–38, 2020, doi: 10.4018/IJESMA.2020010102.

N. K. Nagwani and A. Sharaff, “SMS Spam Filtering and Thread Identification using Bi-Level Text Classification and Clustering Techniques,†J. Inf. Sci., vol. 43, no. 1, pp. 1–13, 2017, doi: 10.1177/0165551515616310.

A. Ghourabi, M. A. Mahmood, and Q. M. Alzubi, “A hybrid CNN-LSTM model for SMS spam detection in arabic and english messages,†Futur. Internet, vol. 12, no. 9, pp. 1–16, 2020, doi: 10.3390/FI12090156.

M. Manap, M. H. Jopri, A. R. Abdullah, R. Karim, M. R. Yusoff, and A. H. Azahar, “A verification of periodogram technique for harmonic source diagnostic analytic by using logistic regression,†Telkomnika (Telecommunication Comput. Electron. Control., vol. 17, no. 1, pp. 497–507, 2019, doi: 10.12928/TELKOMNIKA.v17i1.10390.

A. Setiyono and H. F. Pardede, “Klasifikasi Sms Spam Menggunakan Support Vector Machine,†J. Pilar Nusa Mandiri, vol. 15, no. 2, pp. 275–280, 2019, doi: 10.33480/pilar.v15i2.693.

N. Shiri Harzevili and S. H. Alizadeh, “Mixture of Latent Multinomial Naïve Bayes Classifier,†Appl. Soft Comput. J., vol. 69, pp. 516–527, 2018, doi: 10.1016/j.asoc.2018.04.020.

J. Feldman, A. Thomas-Bachli, J. Forsyth, Z. H. Patel, and K. Khan, “Development of a Global Infectious Disease Activity Database using Natural Language Processing, Machine Learning, and Human Expertise,†J. Am. Med. Informatics Assoc., vol. 26, no. 11, pp. 1355–1359, 2019, doi: 10.1093/jamia/ocz112.

H. M. Safhi, B. Frikh, and B. Ouhbi, “Assessing reliability of Big Data Knowledge Discovery process,†Procedia Comput. Sci., vol. 148, pp. 30–36, 2019, doi: 10.1016/j.procs.2019.01.005.

X. Zheng, M. Wang, and J. Ordieres-Meré, “Comparison of Data Preprocessing Approaches for Applying Deep Learning to Human Activity Recognition in the Context of Industry 4.0,†Sensors (Switzerland), vol. 18, no. 7, 2018, doi: 10.3390/s18072146.

S. Khomsah and Agus Sasmito Aribowo, “Model Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia,†Rekayasa Sist. dan Teknol. Informasi, RESTI, vol. 4, no. 10, pp. 648–654, 2020.

W. T. H. Putri, M. S. Prastio, R. Hendrowati, Y. Sari, and H. T. Y. Achsan, “Content-based Filtering Model for Recommendation of Indonesian Legal Article Study Case of Klinik Hukumonline,†in 2019 International Workshop on Big Data and Information Security, IWBIS 2019, 2019, pp. 9–14, doi: 10.1109/IWBIS.2019.8935726.

F. Rahmi and W. Yudi, “Aplikasi SMS Spam Filtering pada Android menggunakan Naïve Bayes,†Universitas Pendidikan Indonesia, 2017.

S. R. Kunze and S. Auer, “Dataset retrieval,†in Proceedings - 2013 IEEE 7th International Conference on Semantic Computing, ICSC 2013, 2013, pp. 1–8, doi: 10.1109/ICSC.2013.12.

S. Vijayarani and J. Rajaraman, “Text Mining: open Source Tokenization Tools – An Analysis,†Adv. Comput. Intell. An Int. J., vol. 3, no. 1, pp. 37–47, 2016, doi: 10.5121/acii.2016.3104.

C. C. Aggarwal, Machine Learning for Text. Yorktown Heights: Springer, 2018.

F. Rahutomo and A. R. T. H. Ririd, “Evaluasi Daftar Stopword Bahasa Indonesia,†J. Teknol. Inf. dan Ilmu Komput., vol. 6, no. 1, pp. 41–47, 2019, doi: 10.25126/jtiik.2019611226.

A. F. Hidayatullah, “Pengaruh Stopword Terhadap Performa Klasifikasi Tweet Berbahasa Indonesia,†JISKA (Jurnal Inform. Sunan Kalijaga), vol. 1, no. 1, pp. 1–4, 2016.

A. B. Arifa, G. F. Fitriana, and A. R. Hasan, “Temu Kembali Informasi pada Soal Ujian dengan Rencana Pembelajaran Menggunakan Vector Space Model,†J. Resti, vol. 5, no. 1, pp. 8–12, 2021.

L. A. Wirasakti, R. Permadi, A. D. Hartanto, and H. Hartatik, “Pembuatan Kata Kunci Otomatis Dalam Artikel Dengan Pemodelan Topik,†J. Media Inform. Budidarma, vol. 4, no. 1, p. 27, 2020, doi: 10.30865/mib.v4i1.1707.

N. Abdulloh and A. F. Hidayatullah, “Deteksi Cyberbullying pada Cuitan Media Sosial Twitter,†Automata, vol. Vol 1, no. 1, pp. 1–5, 2019.

L. Mutawalli, M. T. A. Zaen, and W. Bagye, “Klasifikasi Teks Sosial Media Twitter Menggunakan Support Vector Machine (Studi Kasus Penusukan Wiranto),†J. Inform. dan Rekayasa Elektron., vol. 2, no. 2, pp. 43–51, 2019, doi: 10.36595/jire.v2i2.117.

A. Santoso and G. Ariyanto, “Implementasi Deep Learning Berbasis Keras untuk Pengenalan Wajah,†Emitor, vol. 18, no. 01, pp. 15–21, 2018, doi: 10.23917/emitor.v18i01.6235.

K. Shah, H. Patel, D. Sanghvi, and M. Shah, “A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification,†Augment. Hum. Res., vol. 5, no. 1, pp. 1–16, 2020, doi: 10.1007/s41133-020-00032-0.

S. Fanissa, M. A. Fauzi, and S. Adinugroho, “Analisis Sentimen Pariwisata di Kota Malang Menggunakan Metode Naive Bayes dan Seleksi Fitur Query Expansion Ranking | Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer,†J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 2, no. 8, pp. 2766–2770, 2018, [Online]. Available: http://j-ptiik.ub.ac.id/index.php/j-ptiik/article/view/1962.

H. Lu, H. Xu, N. Liu, Y. Zhou, and X. Wang, “Data sanity check for deep learning systems via learnt assertions,†in ASE 2019, 2019, pp. 1–3, [Online]. Available: https://2019.ase-conferences.org/details/ase-2019-Late-Breaking-Results/5/Data-Sanity-Check-for-Deep-Learning-Systems-via-Learnt-Assertions.

E. Indrayuni, “Klasifikasi Text Mining Review Produk Kosmetik Untuk Teks Bahasa Indonesia Menggunakan Algoritma Naive Bayes,†J. Khatulistiwa Inform., vol. 7, no. 1, pp. 29–36, 2019, doi: 10.31294/jki.v7i1.1.




DOI: https://doi.org/10.30865/mib.v6i3.4019

Refbacks

  • There are currently no refbacks.


Copyright (c) 2022 JURNAL MEDIA INFORMATIKA BUDIDARMA

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.



JURNAL MEDIA INFORMATIKA BUDIDARMA
Universitas Budi Darma
Secretariat: Sisingamangaraja No. 338 Telp 061-7875998
Email: mib.stmikbd@gmail.com

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.