Big Data Analytics: principles, trends and tasks (a survey)

O.S. Balabanov

Abstract


We review directions (avenues) of Big Data analysis and their practical meaning as well as problems and tasks in this field. Big Data Analytics appears a dominant trend in development of modern information technologies for management and planning in business. A few examples of real applications of Big Data are briefly outlined. Analysis of Big Data is aimed to extract useful sense from raw data collection.  Big Data and Big Analytics have evolved as computer society’s response to the challenges raised by rapid grows in data volumes, variety, heterogeneity, velocity and veracity. Big Data Analytics may be seen as today’s phase of researches and developments known under names ‘Data Mining’, ‘Knowledge Discovery in Data’, ‘intelligent data analysis’ etc. We suggest that there exist three modes of large-scale usage of Big Data: 1) ‘intelligent information retrieval; 2) massive “intermediate” data processing (concentration, mining), which may be performed during one or two scanning; 3) model inference from data; 4) knowledge discovery in data. Stages in data analysis cycle are outlined.  Because of Big Data are raw, distributed, unstructured, heterogeneous and disaggregated (vertically splitted), this data should be prepared for deep analysis. Data preparation may comprise such jobs as data retrieval, access, filtering, cleaning, aggregation, integration, dimensionality reduction, reformatting etc. There are several classes of typical data analysis problems (tasks), including: cases grouping (clustering), predictive model inference (regression, classification, recognition etc.), generative model inference, extracting structures and regularities from data. Distinction between model inference and knowledge discovery is elucidated. We give some suggestion why ‘deep learning’ (one of the most attractive topic by now) is so successive and popular. One of drawbacks of traditional models is they disability to make prediction under incomplete list of predictors (when some predictors are missed) or under augmented list of predictors. One may overcome this drawback using causal model. Causal networks are illuminated in the survey as attractive in that they appear to be expressive generative models and (simultaneously) predictive models in strict sense. This means they pretend to explain how the object at hand is acting (provided they are adequate). Being adequate, causal network facilitates predicting causal effect of local intervention on the object.

Methods used in Big Data Analytics will be reviewed in the next paper.

 


Keywords


Big Data; data analysis; model inference; knowledge discovery; statistical methods; predictive and generative models; causal networks; prediction

References


Big data analytics: a survey. Tsai C.-W., Lai C.-F., Chao H.-C. and Vasilakos A.V. Journal of Big Data. 2015. Vol. 2, N. 1. P. 1-32. CrossRef

Science in the petabyte era. Nature (journal). 2008. Vol. 455, Issue 7209. Springer Nature Ltd.

Frankel F., Reid R. Big data: Distilling meaning from data. Nature. Vol. 455, September 2008. p. 30. CrossRef

Doctorow C. Big data: Welcome to the petacentre. Ibid. P. 16-21. CrossRef

Chen C.L.P. and Zhang C.-Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 2014. Vol. 275. P. 314-347. CrossRef

Cukier K. Data, data everywhere: A special report on managing information. The Economist. 2010, February 25.

Gandomi A. and Haider M. Beyond the hype: Big data concepts, methods, and analytics. Intern. Jour. of Information Management. 2015, Vol. 35, N. 2. Р. 137-144. CrossRef

Watson H.J. Tutorial: Big Data analytics: Concepts, technologies, and applications. Comm. of the Association for Information Systems. 2014. Vol. 34, Article 65. P. 1247-1268. CrossRef

Sivarajah U., Kamal M.M., Irani Z. and Weerakkody V. Critical analysis of Big Data challenges and analytical methods. Journal of Business Research. 2017. Vol. 70. P. 263-286. CrossRef

Bhadani A. and Jothimani D. Big Data: Challenges, opportunities and realities / In.: M.K. Singh and D.G. Kumar (eds.). Effective Big Data management and opportunities for implementation. IGI Global, USA, 2016.

Intern. Journal of Data Science and Analytics. Special issue on Data Science in Europe. 2018. Vol. 6, Issue 3. P. 163-269.

Intern. J. of Data Science and Analytics. Spec. issue on environmental and geospatial data analytics. 2018. Vol. 5, Issue 2-3. P. 81-211. CrossRef

Jacobs A. The pathologies of big data. Comm. of the ACM. 2009, Vol. 52, Issue 8, P. 36-44. CrossRef

Andon P.I. and Balabanov O.S. (2000). Vyjavlenie znanij i izyskanija v bazah dannyh. Podhody, modeli, metody i sistemy. [Knowledge discovery and exploration in databases. Approaches, models, methods and systems]. Problems in programming. N 1-2, P. 513-526. [In Russian]

Balabanov O.S. (2001). Knowledge extraction from databases - advanced computer technologies for intellectual data analysis. Mathematical Machines and Systems. N 1-2. P. 40-54. [In Ukrainian]

Data mining: practical machine learning tools and techniques / I.H. Witten, F. Eibe, M.A. Hall. (3rd ed.). Morgan Kaufmann, San Francisco, CA. 2011. 629 p.

Data Mining. A Knowledge Discovery Approach. K.J. Cios, W. Pedrycz, R.W. Swiniarski and L.A. Kurgan. Springer, 2007, 606 p.

Azzalini A. and Scarpa B. Data analysis and Data Mining: An introduction. Oxford University Press, N.Y., 2012. 288 p.

Andon P.I. and Balabanov O.S. (2007). Structured statistical models: a tool for cognition and modelling. System Research and Information Technologies. N 1. P. 79-98. [In Russian]

Balabanov O.S. (1997). Computer's intelligence: fantastic perspectives and regular progression. Revised 2007. [In Ukrainian] [Electronic resource:] Access: https://www.researchgate.net/publication/332269445_KOMP'UTERNIJ_INTELEKT_FANTASTICNI_PERSPEKTIVI_I_SODENNIJ_POSTUP

Hey T, Tansley S. and Tolle K. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmont, WA. October 2009. 252 p.

Siebes A. Data science as a language: challenges for computer science - a position paper. Intern. J. of Data Science and Analytics. 2018. Vol. 6. P. 177-187. CrossRef

Fan J., Han F. and Liu H. Challenges of Big Data analysis. Nat. Scient. Rev. 2014. Vol. 1, N. 2. P. 293-314. CrossRef

Statistical inference, learning and models in Big Data / B. Franke, J.-F. Plante, R. Roscher, E.A. Lee, C. Smyth, A. Hatefi, F. Chen, E. Gil, A.G. Schwing, A. Selvitella, M.M. Hoffman, R. Grosse, D. Hendricks and N. Reid. Intern. Statistical Review. 2016. Vol. 84, N 3. P. 371-389. CrossRef

Swanson N.R. and Xiong W. Big Data analytics in economics: What have we learned so far, and where should we go from here? Canadian Journal of Economics. 2018. Vol. 51, Issue 3. P. 695-746. CrossRef

The anatomy of big data computing / R. Kune, P. K. Konugurthi, A. Agarwal, R.R. Chillarige and R. Buyya. Software: Practice and Experience. 2016, Vol. 46. P. 79-105. CrossRef

Smirnova E., Ivanescu A., Bai J., Crainiceanu C.M. A practical guide to big data. Statistics and Probability Letters. 2018. Vol. 136. P. 25-29. CrossRef

Shi J.Q. How do statisticians analyse big data - our story. Statistics and Probability Letters. 2018. Vol. 136. P. 130-133. CrossRef

Jiang H., Chen Y., Qiao Z., Weng T. H. and Li K.C. Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing. 2015. Vol. 18, N. 1. P. 369-383. CrossRef

Haughton D. Software packages for data mining. Wiley StatsRef: Statistics Reference Online. 2016. P. 1-5. CrossRef

James G., Witten D., Hastie T. and Tibshirani R. An introduction to statistical learning with applications in R. Springer, N.Y., 2013. 426 p. CrossRef

Graham E. and Timmermann A. Forecasting in Economics and Finance. Annual Review of Economics. 2016. Vol. 8. P. 81-110. CrossRef

Liu B. Web data mining: Exploring hyperlinks, contents, and usage data. Springer-Verlag: Berlin-Heidelberg, 2011. 622 p. CrossRef

Zafarani R., Abbasi M.A. and Liu H. Social media mining. An introduction. Cambridge University Press. 2019. 380 p.

Big Data Analysis: New Algorithms for a New Society. N. Japkowicz and J. Stefa-nowski (eds.), Springer, Switzerland. 2016. 329 p.

Data mining for the Internet of things: Literature review and challenges. F. Chen, P. Deng, J. Wan, D. Zhang. Intern. Journal of Distributed Sensor Networks. Vol. 2015. 14 p.

Esling P. and Agón C. Time-series data mining. ACM Computing Surveys. 2012. Vol. 45, Issue 1. P. 12-34. CrossRef

Pearl J. Causality: models, reasoning, and inference. Cambridge: Cambridge Univ. Press. 2000. 526 p.

Spirtes P., Glymour C. and Scheines R. Causation, prediction and search. New York: MIT Press, 2001. 543 p. CrossRef

Balabanov O.S. (2017). Knowledge discovery in data and causal models in analytical informatics. Problems in Programming. N. 3. P. 96−112. [in Ukrainian]

Peters J., Janzing D. and Schölkopf B. Elements of Causal Inference. Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA, 2017. 265 p.

Shiffrin R.M. Drawing causal inference from Big Data. Proc. Nat. Acad. Scien. USA. 2016. Vol. 113, N. 27. P. 7308-7309. CrossRef

Pearl J. and Bareinboim E. External validity: From do-calculus to transportability across populations. Statistical Science. 2014. Vol. 29, N 4. P. 579-595. CrossRef

Balabanov O.S. (2011). From covariation to causation. Discovery of structures of dependency in data. System Research and Information Technologies. N. 4. P. 104-118. [In Ukrainian]

Balabanov O.S. (2016). Reconstruction of causal networks via analysis of Markov properties. Mathematical Machines and Systems. N. 1. P. 16-26. [In Ukrainian]

Giudici P. Financial data science. Statistics and Probability Letters. 2018. Vol. 136. P. 160-164. CrossRef

Machine learning. Special issue on applications of machine learning and the knowledge discovery process. R. Kohavi, F. Provost. (Eds.) Machine Learning. 1998. Vol. 30, N.2/3. P. 127-274. CrossRef

nd SIGKDD Conference on Knowledge Discovery and Data Mining, August 13-17, 2016. San Francisco, California.

th SIGKDD Conference on Knowledge Discovery and Data Mining, August 19-23, 2018. London, UK.

LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015. Vol. 521. P. 436-444. CrossRef

Donoho D.L. 50 Years of Data Science. Journal of Computational and Graphical Statistics. 2017. Vol. 26, Issue 4. P. 745-766. CrossRef

Bühlmann P. and van de Geer S. Statistics for high-dimensional data: Methods, theory and applications. Springer, 2011. 556 p. CrossRef

Bühlmann P. and van de Geer S. Statistics for big data: A perspective. Statistics and Probability Letters. 2018. Vol. 136. P. 37-41. CrossRef

Secchi P. On the role of statistics in the era of big data: A call for a debate. Ibid. P. 10-14. CrossRef

Quarteroni A. The role of statistics in the era of big data: A computational scientist' perspective. Ibid. P. 63-67. CrossRef

Cox D.R., Kartsonaki C., Keogh R.H. Big data: Some statistical issues. Ibid. P. 111-115. CrossRef

James G. M. Statistics within business in the era of big data. Ibid. P. 155-159. CrossRef

Weihs C. and Ickstadt K. Data Science: the impact of statistics. Intern. Journal of Data Science and Analytics. 2018. Vol. 6. P. 189-194. CrossRef

Efron B. and Hastie T. Computer age statistical inference. Cambridge University Press, N.Y., 2016. 475 p. CrossRef

Carmichael I. and Marron J.S. Data science vs. statistics: two cultures? Japanese Journal of Statistics and Data Science. 2018. Vol. 1, Issue 1. P. 117-138. CrossRef




DOI: https://doi.org/10.15407/pp2019.02.047

Refbacks

  • There are currently no refbacks.