Big Data Analytics: principles, trends and tasks (a survey)
Abstract
We review directions (avenues) of Big Data analysis and their practical meaning as well as problems and tasks in this field. Big Data Analytics appears a dominant trend in development of modern information technologies for management and planning in business. A few examples of real applications of Big Data are briefly outlined. Analysis of Big Data is aimed to extract useful sense from raw data collection. Big Data and Big Analytics have evolved as computer society’s response to the challenges raised by rapid grows in data volumes, variety, heterogeneity, velocity and veracity. Big Data Analytics may be seen as today’s phase of researches and developments known under names ‘Data Mining’, ‘Knowledge Discovery in Data’, ‘intelligent data analysis’ etc. We suggest that there exist three modes of large-scale usage of Big Data: 1) ‘intelligent information retrieval; 2) massive “intermediate” data processing (concentration, mining), which may be performed during one or two scanning; 3) model inference from data; 4) knowledge discovery in data. Stages in data analysis cycle are outlined. Because of Big Data are raw, distributed, unstructured, heterogeneous and disaggregated (vertically splitted), this data should be prepared for deep analysis. Data preparation may comprise such jobs as data retrieval, access, filtering, cleaning, aggregation, integration, dimensionality reduction, reformatting etc. There are several classes of typical data analysis problems (tasks), including: cases grouping (clustering), predictive model inference (regression, classification, recognition etc.), generative model inference, extracting structures and regularities from data. Distinction between model inference and knowledge discovery is elucidated. We give some suggestion why ‘deep learning’ (one of the most attractive topic by now) is so successive and popular. One of drawbacks of traditional models is they disability to make prediction under incomplete list of predictors (when some predictors are missed) or under augmented list of predictors. One may overcome this drawback using causal model. Causal networks are illuminated in the survey as attractive in that they appear to be expressive generative models and (simultaneously) predictive models in strict sense. This means they pretend to explain how the object at hand is acting (provided they are adequate). Being adequate, causal network facilitates predicting causal effect of local intervention on the object.
Methods used in Big Data Analytics will be reviewed in the next paper.
Keywords
Full Text:
PDF (Українська)References
Big data analytics: a survey. Tsai C.-W., Lai C.-F., Chao H.-C. and Vasilakos A.V. Journal of Big Data. 2015. Vol. 2, N. 1. P. 1-32. CrossRef
Science in the petabyte era. Nature (journal). 2008. Vol. 455, Issue 7209. Springer Nature Ltd.
Frankel F., Reid R. Big data: Distilling meaning from data. Nature. Vol. 455, September 2008. p. 30. CrossRef
Doctorow C. Big data: Welcome to the petacentre. Ibid. P. 16-21. CrossRef
Chen C.L.P. and Zhang C.-Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 2014. Vol. 275. P. 314-347. CrossRef
Cukier K. Data, data everywhere: A special report on managing information. The Economist. 2010, February 25.
Gandomi A. and Haider M. Beyond the hype: Big data concepts, methods, and analytics. Intern. Jour. of Information Management. 2015, Vol. 35, N. 2. Р. 137-144. CrossRef
Watson H.J. Tutorial: Big Data analytics: Concepts, technologies, and applications. Comm. of the Association for Information Systems. 2014. Vol. 34, Article 65. P. 1247-1268. CrossRef
Sivarajah U., Kamal M.M., Irani Z. and Weerakkody V. Critical analysis of Big Data challenges and analytical methods. Journal of Business Research. 2017. Vol. 70. P. 263-286. CrossRef
Bhadani A. and Jothimani D. Big Data: Challenges, opportunities and realities / In.: M.K. Singh and D.G. Kumar (eds.). Effective Big Data management and opportunities for implementation. IGI Global, USA, 2016.
Intern. Journal of Data Science and Analytics. Special issue on Data Science in Europe. 2018. Vol. 6, Issue 3. P. 163-269.
Intern. J. of Data Science and Analytics. Spec. issue on environmental and geospatial data analytics. 2018. Vol. 5, Issue 2-3. P. 81-211. CrossRef
Jacobs A. The pathologies of big data. Comm. of the ACM. 2009, Vol. 52, Issue 8, P. 36-44. CrossRef
Andon P.I. and Balabanov O.S. (2000). Vyjavlenie znanij i izyskanija v bazah dannyh. Podhody, modeli, metody i sistemy. [Knowledge discovery and exploration in databases. Approaches, models, methods and systems]. Problems in programming. N 1-2, P. 513-526. [In Russian]
Balabanov O.S. (2001). Knowledge extraction from databases - advanced computer technologies for intellectual data analysis. Mathematical Machines and Systems. N 1-2. P. 40-54. [In Ukrainian]
Data mining: practical machine learning tools and techniques / I.H. Witten, F. Eibe, M.A. Hall. (3rd ed.). Morgan Kaufmann, San Francisco, CA. 2011. 629 p.
Data Mining. A Knowledge Discovery Approach. K.J. Cios, W. Pedrycz, R.W. Swiniarski and L.A. Kurgan. Springer, 2007, 606 p.
Azzalini A. and Scarpa B. Data analysis and Data Mining: An introduction. Oxford University Press, N.Y., 2012. 288 p.
Andon P.I. and Balabanov O.S. (2007). Structured statistical models: a tool for cognition and modelling. System Research and Information Technologies. N 1. P. 79-98. [In Russian]
Balabanov O.S. (1997). Computer's intelligence: fantastic perspectives and regular progression. Revised 2007. [In Ukrainian] [Electronic resource:] Access: https://www.researchgate.net/publication/332269445_KOMP'UTERNIJ_INTELEKT_FANTASTICNI_PERSPEKTIVI_I_SODENNIJ_POSTUP
Hey T, Tansley S. and Tolle K. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmont, WA. October 2009. 252 p.
Siebes A. Data science as a language: challenges for computer science - a position paper. Intern. J. of Data Science and Analytics. 2018. Vol. 6. P. 177-187. CrossRef
Fan J., Han F. and Liu H. Challenges of Big Data analysis. Nat. Scient. Rev. 2014. Vol. 1, N. 2. P. 293-314. CrossRef
Statistical inference, learning and models in Big Data / B. Franke, J.-F. Plante, R. Roscher, E.A. Lee, C. Smyth, A. Hatefi, F. Chen, E. Gil, A.G. Schwing, A. Selvitella, M.M. Hoffman, R. Grosse, D. Hendricks and N. Reid. Intern. Statistical Review. 2016. Vol. 84, N 3. P. 371-389. CrossRef
Swanson N.R. and Xiong W. Big Data analytics in economics: What have we learned so far, and where should we go from here? Canadian Journal of Economics. 2018. Vol. 51, Issue 3. P. 695-746. CrossRef
The anatomy of big data computing / R. Kune, P. K. Konugurthi, A. Agarwal, R.R. Chillarige and R. Buyya. Software: Practice and Experience. 2016, Vol. 46. P. 79-105. CrossRef
Smirnova E., Ivanescu A., Bai J., Crainiceanu C.M. A practical guide to big data. Statistics and Probability Letters. 2018. Vol. 136. P. 25-29. CrossRef
Shi J.Q. How do statisticians analyse big data - our story. Statistics and Probability Letters. 2018. Vol. 136. P. 130-133. CrossRef
Jiang H., Chen Y., Qiao Z., Weng T. H. and Li K.C. Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing. 2015. Vol. 18, N. 1. P. 369-383. CrossRef
Haughton D. Software packages for data mining. Wiley StatsRef: Statistics Reference Online. 2016. P. 1-5. CrossRef
James G., Witten D., Hastie T. and Tibshirani R. An introduction to statistical learning with applications in R. Springer, N.Y., 2013. 426 p. CrossRef
Graham E. and Timmermann A. Forecasting in Economics and Finance. Annual Review of Economics. 2016. Vol. 8. P. 81-110. CrossRef
Liu B. Web data mining: Exploring hyperlinks, contents, and usage data. Springer-Verlag: Berlin-Heidelberg, 2011. 622 p. CrossRef
Zafarani R., Abbasi M.A. and Liu H. Social media mining. An introduction. Cambridge University Press. 2019. 380 p.
Big Data Analysis: New Algorithms for a New Society. N. Japkowicz and J. Stefa-nowski (eds.), Springer, Switzerland. 2016. 329 p.
Data mining for the Internet of things: Literature review and challenges. F. Chen, P. Deng, J. Wan, D. Zhang. Intern. Journal of Distributed Sensor Networks. Vol. 2015. 14 p.
Esling P. and Agón C. Time-series data mining. ACM Computing Surveys. 2012. Vol. 45, Issue 1. P. 12-34. CrossRef
Pearl J. Causality: models, reasoning, and inference. Cambridge: Cambridge Univ. Press. 2000. 526 p.
Spirtes P., Glymour C. and Scheines R. Causation, prediction and search. New York: MIT Press, 2001. 543 p. CrossRef
Balabanov O.S. (2017). Knowledge discovery in data and causal models in analytical informatics. Problems in Programming. N. 3. P. 96−112. [in Ukrainian]
Peters J., Janzing D. and Schölkopf B. Elements of Causal Inference. Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA, 2017. 265 p.
Shiffrin R.M. Drawing causal inference from Big Data. Proc. Nat. Acad. Scien. USA. 2016. Vol. 113, N. 27. P. 7308-7309. CrossRef
Pearl J. and Bareinboim E. External validity: From do-calculus to transportability across populations. Statistical Science. 2014. Vol. 29, N 4. P. 579-595. CrossRef
Balabanov O.S. (2011). From covariation to causation. Discovery of structures of dependency in data. System Research and Information Technologies. N. 4. P. 104-118. [In Ukrainian]
Balabanov O.S. (2016). Reconstruction of causal networks via analysis of Markov properties. Mathematical Machines and Systems. N. 1. P. 16-26. [In Ukrainian]
Giudici P. Financial data science. Statistics and Probability Letters. 2018. Vol. 136. P. 160-164. CrossRef
Machine learning. Special issue on applications of machine learning and the knowledge discovery process. R. Kohavi, F. Provost. (Eds.) Machine Learning. 1998. Vol. 30, N.2/3. P. 127-274. CrossRef
nd SIGKDD Conference on Knowledge Discovery and Data Mining, August 13-17, 2016. San Francisco, California.
th SIGKDD Conference on Knowledge Discovery and Data Mining, August 19-23, 2018. London, UK.
LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015. Vol. 521. P. 436-444. CrossRef
Donoho D.L. 50 Years of Data Science. Journal of Computational and Graphical Statistics. 2017. Vol. 26, Issue 4. P. 745-766. CrossRef
Bühlmann P. and van de Geer S. Statistics for high-dimensional data: Methods, theory and applications. Springer, 2011. 556 p. CrossRef
Bühlmann P. and van de Geer S. Statistics for big data: A perspective. Statistics and Probability Letters. 2018. Vol. 136. P. 37-41. CrossRef
Secchi P. On the role of statistics in the era of big data: A call for a debate. Ibid. P. 10-14. CrossRef
Quarteroni A. The role of statistics in the era of big data: A computational scientist' perspective. Ibid. P. 63-67. CrossRef
Cox D.R., Kartsonaki C., Keogh R.H. Big data: Some statistical issues. Ibid. P. 111-115. CrossRef
James G. M. Statistics within business in the era of big data. Ibid. P. 155-159. CrossRef
Weihs C. and Ickstadt K. Data Science: the impact of statistics. Intern. Journal of Data Science and Analytics. 2018. Vol. 6. P. 189-194. CrossRef
Efron B. and Hastie T. Computer age statistical inference. Cambridge University Press, N.Y., 2016. 475 p. CrossRef
Carmichael I. and Marron J.S. Data science vs. statistics: two cultures? Japanese Journal of Statistics and Data Science. 2018. Vol. 1, Issue 1. P. 117-138. CrossRef
DOI: https://doi.org/10.15407/pp2019.02.047
Refbacks
- There are currently no refbacks.