Big Data: Controlling Fraud by Using Machine Learning Libraries on Spark

Authors

  • Ferhat Karataş
  • Sevcan Aytaç Korkmaz

DOI:

https://doi.org/10.18100/ijamec.2018138629

Keywords:

k-Means, Spark, Machine Learning, Anomaly Detection

Abstract

Continuous changes and the high calculation volume in network data distribution have made it more difficult to detect abnormal behaviors within and analyze data.  For this cause, large data solutions have gained important. With the advancement of internet technologies and the digital age, cyber-attacks have increased steadily. The k-Means clustering algorithm is one of the most widely used algorithms in the world of data mining.  Clustering algorithms are algorithms that automatically divide data into smaller clusters or sub-clusters. The algorithm places statistically similar records in the same group. In this article, we have used k-Means method from the Machine Learning libraries on Spark to determine whether the incoming network values are normal behavior. 400 thousand network data were used in this article. This data was obtained from KDD Cup 1999 Data. We have detected 10 abnormal behaviors from 400 thousand network data with k-means method.

Downloads

Download data is not yet available.

References

Terzi, Duygu Sinanc, Ramazan Terzi, and Seref Sagiroglu. "Big data analytics for network anomaly detection from netflow data." Computer Science and Engineering (UBMK), 2017 International Conference on. IEEE, 2017.

Budget-in-Brief Fiscal Year 2016, US Department of Homeland Security, Editor. 2016.

Norton Cyber Security Insights Report. 2016.

Meng, Xiangrui, et al. "Mllib: Machine learning in apache spark." The Journal of Machine Learning Research 17.1, 1235-1241, 2016.

Terzi, Duygu Sinanc, Ramazan Terzi, and Seref Sagiroglu. "Big data analytics for network anomaly detection from netflow data." Computer Science and Engineering (UBMK), 2017 International Conference on. IEEE, 2017.

Bhuyan, Monowar H., Dhruba Kumar Bhattacharyya, and Jugal K. Kalita. "Network anomaly detection: methods, systems and tools." IEEE communications surveys & tutorials 16.1, 303-336, 2014.

Goldstein, Markus, and Seiichi Uchida. "A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data." PloS one 11.4, 2016.

Lakhina, Anukool, Mark Crovella, and Christophe Diot. "Diagnosing network-wide traffic anomalies." ACM SIGCOMM Computer Communication Review. Vol. 34. No. 4. ACM, 2004.

Soule, Augustin, Kavé Salamatian, and Nina Taft. "Combining filtering and statistical methods for anomaly detection." Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement. USENIX Association, 2005.

Barford, Paul, et al. "A signal analysis of network traffic anomalies." Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment. ACM, 2002.

Fontugne, Romain, et al. "Random projection and multiscale wavelet leader based anomaly detection and address identification in internet traffic." Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.

Fontugne, Romain, and Kensuke Fukuda. "A Hough-transform-based anomaly detector with an adaptive time interval." ACM SIGAPP Applied Computing Review 11.3, 41-51, 2011.

Kanda, Yoshiki, et al. "ADMIRE: Anomaly detection method using entropy-based PCA with three-step sketches." Computer Communications 36.5, 575-588, 2013.

Silveira, Fernando, et al. "ASTUTE: Detecting a different class of traffic anomalies." ACM SIGCOMM Computer Communication Review 40.4, 267-278, 2010.

Mazel, Johan, et al. "Hunting attacks in the dark: clustering and correlation analysis for unsupervised anomaly detection." International Journal of Network Management 25.5, 283-305, 2015.

Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly detection: A survey." ACM computing surveys (CSUR) 41.3, 15, 2009.

Marnerides, Angelos K., Alberto Schaeffer-Filho, and Andreas Mauthe. "Traffic anomaly diagnosis in internet backbone networks: a survey." Computer Networks 73, 224-243, 2014.

Lakhina, Anukool, Mark Crovella, and Christophe Diot. "Mining anomalies using traffic feature distributions." ACM SIGCOMM Computer Communication Review. Vol. 35. No. 4. ACM, 2005.

Lakhina A, Crovella M, Diot C. Diagnosing network-wide traffic anomalies. Proceedings of the 4th conference on Applications, technologies, architectures, and protocols for computer communications (SIGCOMM), 2004.

Xu, Kuai, Zhi-Li Zhang, and Supratik Bhattacharyya. "Internet traffic behavior profiling for network security monitoring." IEEE/ACM Transactions On Networking 16.6, 1241-1252, 2008.

Lakhina, Anukool, Mark Crovella, and Christophe Diot. "Mining anomalies using traffic feature distributions." ACM SIGCOMM Computer Communication Review. Vol. 35. No. 4. ACM, 2005.

Fernandes, Guilherme, and Philippe Owezarski. "Automated classification of network traffic anomalies." International Conference on Security and Privacy in Communication Systems. Springer, Berlin, Heidelberg, 2009.

Silveira, Fernando, and Christophe Diot. "URCA: Pulling out anomalies by their root causes." INFOCOM, 2010 Proceedings IEEE. IEEE, 2010.

Kumari, R., et al. "Anomaly detection in network traffic using K-mean clustering." Recent Advances in Information Technology (RAIT), 2016 3rd International Conference on. IEEE, 2016.

Muda, Z., et al. "Intrusion detection based on K-Means clustering and Naïve Bayes classification." Information Technology in Asia (CITA 11), 2011 7th International Conference on. IEEE, 2011.

Ozcift, Akin, and Arif Gulten. "Assessing effects of pre-processing mass spectrometry data on classification performance." European Journal of Mass Spectrometry 14.5, 267-273, 2008.

FIRAT, Mahmut, et al. "K-ortalamalar yöntemi ile yıllık yağışların sınıflandırılması ve homojen bölgelerin belirlenmesi." Teknik Dergi 23.113 (2012).

Leśniak, Andrzej, and Zbigniew Isakow. "Space–time clustering of seismic events and hazard assessment in the Zabrze-Bielszowice coal mine, Poland." International Journal of Rock Mechanics and Mining Sciences 46.5, 918-928, 2009.

https://databricks.com/blog/2015/01/28/introducing-streaming-k means-in-spark-1-2.html

https://www.researchgate.net/publication/318155071_Comparative_Study_of_Apache_Spark_MLlib_Clustering_Algorithms

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

https://gist.github.com/bilgipark/72de4f6b6ef75db4178b12badc7048ce

https://gist.github.com/bilgipark/aca73e9f2259d9005f0912fe0c852914

https://gist.github.com/bilgipark/a21d645eb09e0e3e74ee95dd61180a8c

Korkmaz, Sevcan Aytac, and Mehmet Fatih Korkmaz. "A new method based cancer detection in mammogram textures by finding feature weights and using Kullback–Leibler measure with kernel estimation." Optik-International Journal for Light and Electron Optics 126.20, 2576-2583, 2015.

Korkmaz, Sevcan Aytac, Mehmet Fatih Korkmaz, and Mustafa Poyraz. "Diagnosis of breast cancer in light microscopic and mammographic images textures using relative entropy via kernel estimation." Medical & biological engineering & computing 54.4, 561-573, 2016.

KORKMAZ, Sevcan Aytaç, et al. Recognition of the stomach cancer images with probabilistic HOG feature vector histograms by using HOG features. In: Intelligent Systems and Informatics (SISY), 2017 IEEE 15th International Symposium on. IEEE, p. 000339-000342, 2017.

Korkmaz, Sevcan Aytac, et al. "Diagnosis of breast cancer nano-biomechanics images taken from atomic force microscope." Journal of Nanoelectronics and Optoelectronics 11.4, 551-559, 2016.

Korkmaz, Sevcan Aytaç, and Hamidullah Binol. "Analysis of Molecular Structure Images by using ANN, RF, LBP, HOG, and Size Reduction Methods for early Stomach Cancer Detection."Journal of Molecular Structure (2017).

Korkmaz, S. A. (2018). LBP Özelliklerine Dayanan Lokasyon Koruyan Projeksiyon (LPP) Boyut Azaltma Metodunun Farklı Sınıflandırıcılar Üzerindeki Performanslarının Karşılaştırılması. Sakarya University Journal of Science, 22(4), 1-1.

Korkmaz, S. Aytac, and Mustafa Poyraz. "A New Method Based for Diagnosis of Breast Cancer Cells from Microscopic Images: DWEE--JHT." Journal of medical systems 38.9 (2014): 1.

Korkmaz, Sevcan Aytac, and Mustafa Poyraz. "Least square support vector machine and minumum redundacy maximum relavance for diagnosis of breast cancer from breast microscopic images." Procedia-Social and Behavioral Sciences 174 (2015): 4026-4031.

KORKMAZ, Sevcan Aytac; EREN, Haluk. Cancer detection in mammograms estimating feature weights via Kullback-Leibler measure. In: Image and Signal Processing (CISP), 2013 6th International Congress on. IEEE, 2 (2013):1035-1040.

KORKMAZ, Sevcan AYTAÇ. "DETECTING CELLS USING IMAGE SEGMENTATION OF THE CERVICAL CANCER IMAGES TAKEN FROM SCANNING ELECTRON MICROSCOPE." The Online Journal of Science and Technology-October 7.4 (2017).

KORKMAZ, Sevcan Aytaç, et al. Recognition of the stomach cancer images with probabilistic HOG feature vector histograms by using HOG features. In: Intelligent Systems and Informatics (SISY), 2017 IEEE 15th International Symposium on. IEEE, (2017). p. 000339-000342.

Korkmaz, S. A., Poyraz, M., Bal, A., Binol, H., Özercan, I. H., Korkmaz, M. F., & Aydin, A. M. (2015). New methods based on mRMR_LSSVM and mRMR_KNN for diagnosis of breast cancer from microscopic and mammography images of some patients. International Journal of Biomedical Engineering and Technology, 19(2), 105-117.

S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in action. Manning Shelter Island, 2011.

M. Zaharia, M. Chowdhury, and T. Das, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of USENIX conference on Networked Systems Design and Implementation, 2012, pp. 2–2.

Wang, Bowen, et al. "Parallelizing k-means-based clustering on spark." Advanced Cloud and Big Data (CBD), 2016 International Conference on. IEEE, 2016.

Sinha, Ankita, and Prasanta K. Jana. "A novel K-means based clustering algorithm for big data." Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on. IEEE, 2016.

Kusuma, Ilham, et al. "Design of intelligent k-means based on spark for big data clustering." Big Data and Information Security (IWBIS), International Workshop on. IEEE, 2016.

Downloads

Published

31-03-2018

Issue

Section

Research Articles

How to Cite

[1]
“Big Data: Controlling Fraud by Using Machine Learning Libraries on Spark”, J. Appl. Methods Electron. Comput., vol. 6, no. 1, pp. 1–5, Mar. 2018, doi: 10.18100/ijamec.2018138629.

Similar Articles

1-10 of 115

You may also start an advanced similarity search for this article.