Tasks and methods of Big Data analysis (a survey)

O.S. Balabanov

Abstract


We review tasks and methods most relevant to Big Data analysis. Emphasis is made on the conceptual and pragmatic issues of the tasks and methods (avoiding unnecessary mathematical details). We suggest that all scope of jobs with Big Data fall into four conceptual modes (types): four modes of large-scale usage of Big Data: 1) intelligent information retrieval; 2) massive (large-scale) conveyed data processing (mining); 3) model inference from data; 4) knowledge extraction from data (regularities detection and structures discovery). The essence of various tasks (clustering, regression, generative model inference, structures discovery etc.) are elucidated. We compare key methods of clustering, regression, classification, deep learning, generative model inference and causal discovery. Cluster analysis may be divided into methods based on mean distance, methods based on local distance and methods based on a model. The targeted (predictive) methods fall into two categories: methods which infer a model; "tied to data" methods which compute prediction directly from data. Common tasks of temporal data analysis are briefly overviewed. Among diverse methods of generative model inference we make focus on causal network learning because models of this class are very expressive, flexible and are able to predict effects of interventions under varying conditions. Independence-based approach to causal network inference from data is characterized. We give a few comments on specificity of task of dynamical causal network inference from timeseries. Challenges of Big Data analysis raised by data multidimensionality, heterogeneity and huge volume are presented. Some statistical issues related to the challenges are summarized.

Problems in programming 2019; 3: 58-85


Keywords


Big Data; data analysis; generative model inference; statistical methods; clustering; regression; prediction; pattern discovery; temporal data; causal networks

References


Balabanov O.S. Big Data Analytics: principles, trends and tasks (a survey). Problems in programming. 2019. N 2. P. 47-68. (ISSN 1727-4907) [In Ukrainian]. CrossRef

Bühlmann P., Drineas P., Kane M., van der Laan M. (eds.) Handbook of Big Data. Taylor and Francis, 2016. 456 p.

Mayer-Schönberger V., Cukier K. Big Data: A revolution that will transform how we live, work, and think. Boston, MA: Houghton Mifflin Harcourt, 2013. 256 p.

Chen C.L.P. and Zhang C.-Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 2014. Vol. 275. P. 314-347. CrossRef

Chen M., Mao S. and Liu Y. Big Data: A Survey. Mobile Networks and Applications. 2014. Vol. 19, Issue 2. P. 171-209. CrossRef

Bhadani A. and Jothimani D. Big Data: Challenges, opportunities and realities / In.: M.K. Singh and D.G. Kumar (eds.). Effective Big Data management and opportunities for implementation. - IGI Global, Pennsylvania, USA, 2016. - [Електронний ресурс] Доступ: https://arxiv.org/pdf/1705.04928.

Oussous A., Benjelloun F.-Z., Lahcen A.A. and Belfkih S. Big Data technologies: A survey. Journal of King Saud University. Computer and Information Sciences. 2018. Vol. 30, Issue 4. P. 431-448. CrossRef

Cao L. Data science: a comprehensive overview. ACM Computing Surveys. 2017. Vol. 50, N 3, Article 43, 42 p. CrossRef

Gandomi A. and Haider M. Beyond the hype: Big data concepts, methods, and analytics. Intern. Jour. of Information Management. 2015. Vol. 35, N 2. Р. 137-144. CrossRef

Tsai C.-W., Lai C.-F., Chao H.-C. and Vasi-lakos A.V. Big data analytics: a survey. Journal of Big Data. 2015. Vol. 2, N 1. P. 1-32. CrossRef

Watson H.J. Tutorial: Big Data analytics: Concepts, technologies, and applications. Comm. of the Association for Information Systems. 2014. Vol. 34, Article 65. P. 1247-1268. CrossRef

Fan J., Han F. and Liu H. Challenges of Big Data analysis. Nat. Scient. Rev. 2014., Vol. 1, N 2. P. 293-314. CrossRef

Franke B., Plante J.-F., Roscher R., Lee E.A., Smyth C., Hatefi A., Chen F., Gil E., Schwing A.G., Selvitella A., Hoffman M.M., Grosse R., Hendricks D. and Reid N. Statistical inference, learning and models in Big Data. Intern. Statistical Review. 2016. Vol. 84, N. 3. P. 371-389. CrossRef

Zafarani R., Abbasi M.A. and Liu H. Social media mining. An introduction. Cambridge University Press, 2019. 380 p.

Andon P.I. and Balabanov O.S. Vyjavlenie znanij i izyskanija v bazah dannyh. Podhody, modeli, metody i sistemy. Problems in programming. 2000. N 1-2. P. 513-526. (Kyjv, UA). [In Russian].

Balabanov O.S. Knowledge extraction from databases - advanced computer technologies for intellectual data analysis. Mathematical Machines and Systems. 2001. N 1-2. P. 40-54. [In Russian].

Azzalini A. and Scarpa B. Data analysis and Data Mining: An introduction. - N.Y.: Oxford University Press, 2012. 288 p.

Swanson N.R. and Xiong W. Big Data analytics in economics: What have we learned so far, and where should we go from here? Canadian J. of Economics. 2018, Vol. 51, Issue 3. P. 695-746. CrossRef

Graham E. and Timmermann A. Forecasting in Economics and Finance. Annual Review of Economics. (2016). Vol. 8. P. 81-110. CrossRef

Weihs C. and Ickstadt K. Data Science: the impact of statistics. Intern. J. of Data Science and Analytics. 2018. Vol. 6. P. 189-194. CrossRef

The role of statistics in the era of big data. Special issue of the journal: Statistics and Probability Letters. May 2018. Vol. 136. CrossRef

Secchi P. On the role of statistics in the era of big data: A call for a debate. Statistics and Probability Letters. 2018. Vol. 136. P. 10-14. CrossRef

Witten I.H., Eibe F., Hall M.A. (3rd ed.).Data mining: practical machine learning tools and techniques. Morgan Kaufmann, 2011. 629 p. CrossRef

Maimon O., Rokach L. (Eds.) Data Mining and Knowledge Discovery Handbook. 2nd ed., Springer-Verlag New-York Inc., 2010. 1285 p. CrossRef

Murphy K.P. Machine learning: a probabilistic perspective. MIT Press, Cambridge, Massachusetts, 2012. 1055 p.

Hastie T., Tibshirani R. and Friedman J. The elements of statistical learning. (2nd ed.). Springer. 2009. 745 p. CrossRef

Efron B. and Hastie T. Computer age statistical inference. Cambridge University Press, 2016. 475 p. CrossRef

Efron B. Large-scale inference. Stanford University Press, 2010. 263 p. CrossRef

James G., Witten D., Hastie T. and Tibshirani R. An introduction to statistical learning with applications in R. Springer, N.Y., 2013. CrossRef

Berkhin P. A survey of clustering data mining techniques. In: Kogan J., Nicholas C., Teboulle M. (eds.). Grouping multidi-mensional data. Springer-Verlag: Berlin-Heidelberg, 2006. P. 25-71. CrossRef

Bouveyron C., Brunet-Saumard C. Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis. 2014. Vol. 71. P. 52-78. CrossRef

Kurban H., Jenne M. and Dalkilic M.M. Using data to build a better EM: EM* for big data. Intern. J. of Data Science and Analytics. 2017. Vol. 4, Issue 2. P. 83-97. CrossRef

LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015. Vol. 521, P.436-444. CrossRef

Esling P. and Agón C. Time-series data mining. ACM Computing Surveys. 2012. Vol. 45, Issue 1. P. 12-34. CrossRef

Chandola V., Banerjee A. and Kumar V. Anomaly detection for discrete sequences: a survey. IEEE Trans. on Knowledge and Data Eng. (TKDE). 2012. Vol. 24, N 5. P. 823-839. CrossRef

Truong C., Oudre L. and Vayatis N. Selective review of offline change point detection methods. [Electronic resource] URL: https://arxiv.org/abs/1801.00718.

Aminikhanghahi S. and Cook D.J. A survey of methods for time series change point detection. Knowledge and Information Systems. 2017. Vol. 51, Issue 2. P. 339-367. CrossRef

Frick K., Munk A. and Sieling H. Multiscale change point inference. J. Roy. Statist. Soc., ser. B. 2014. Vol. 76, Pt. 3. P. 495-580. CrossRef

Wang T. and Samworth R.J. High dimensional change point estimation via sparse projection. J. Roy. Statist. Soc., ser. B. 2018. Vol. 80, Pt. 1. P. 57-83. CrossRef

Liao T.W. Clustering of time series data - a survey. Pattern Recognition. 2005. Vol. 38. P. 1857-1874. CrossRef

Atluri G., Karpatne A. and Kumar V. Spatio-temporal Data Mining: a survey of problems and methods. ACM Computing Surveys. 2018. Vol. 51, Issue 4, Article N 83. CrossRef

Lee T.-W., Girolami M., Bell A.J., Sejnowski T.J. A unifying information-theoretic framework for Independent Component Analysis. Intern. J. Computers and Mathematics with Applications. 2000. Vol. 39. P. 1-21. CrossRef

Neville J. and Jensen D. Relational Dependency Networks. Jour. of Machine Learning Res. 2007. Vol. 8. P. 653-692.

De Raedt L., Kersting K., Natarajan S. and Poole D. Statistical relational artificial intelligence: Logic, probability, and computation. Synthesis Lectures on Artificial Intelligence and Machine Learning. 2016. Vol. 10, N 2. P.1-89. CrossRef

Kazemi S.M., Buchman D., Kersting K., Natarajan S. and Poole D. Relational logistic regression: The directed analog of Markov logic networks. Workshops at the Twenty-Eighth AAAI Conf. on Artificial Intelligence. 2014. P. 41-43.

Pearl J. Causality: models, reasoning, and inference. Cambridge: Cambridge Univ. Press, 2000. 526 p.

Spirtes P., Glymour C. and Scheines R. Causation, prediction and search. New York: MIT Press, 2001. 543 p. CrossRef

Peters J., Janzing D. and Schölkopf B. Elements of Causal Inference. Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA, 2017. 265 p.

Balabanov O.S. Knowledge discovery in data and causal models in analytical in-formatics. Problems in programming. 2017. N 3. P. 96-112. (ISSN 1727-4907). [in Ukrainian].)

Raghu V.K., Ramsey J.D., Morris A., Manatakis D.V., Sprites P., Chrysanthis P.K., Glymour C., Benos P.V. Comparison of strategies for scalable causal discovery of latent variable models from mixed data. Intern. Jour. of Data Science and Analytics. 2018. Vol. 6, Issue 1. P. 33-45. CrossRef

Tsagris M., Borboudakis G., Lagani V., Tsamardinos I. Constraint-based causal discovery with mixed data. Intern. Jour. of Data Science and Analytics. 2018. Vol. 6, Issue 1. P. 19-30. CrossRef

Pearl J. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM. 2019. Vol. 62, Issue 3. P. 54-60. CrossRef

Pearl J. and Bareinboim E. External validity: From do-calculus to transportability across populations. Statistical Science. 2014. Vol. 29, N 4. P. 579-595. CrossRef

Malinsky D. and Spirtes P. Causal structure learning from multivariate time series in settings with unmeasured confounding. Proc. of 2018 ACM SIGKDD Workshop on Causal Discovery, August 2018, London, UK. PMLR, Vol. 92. P. 23-47.

Entner D. and Hoyer P.O. On causal discovery from time series data using FCI. Proc. of the 5th European Workshop on Probabilistic graphical models. 2010, Helsinki, Finland. P. 121-128.

Runge J. Causal network reconstruction from timeseries: From theoretical assumptions to practical estimation. Chaos. 2018. Vol. 28, paper 075310. 20 p. CrossRef

Balabanov O.S. Upper bound on the sum of correlations of three indicators under the absence of a common factor. Cybernetics and Systems Analysis. 2019. Vol. 55, N 2. P. 174-185. CrossRef

Balabanov O.S. From covariation to causation: Discovery of dependency structures in data. System research and information technologies. 2011. N 4, P. 104-118. [In Ukrainian]

Colombo D., Maathuis M.H., Kalisch M. and Richardson T.S. Learning high-dimensional directed acyclic graphs with latent and selection variables. Annals of Statistics. 2012. Vol. 40, Issue 1. P. 294-321. CrossRef

Colombo D., Maathuis M.H. Order-independent constraint-based causal structure learning. Jour. of Machine Learning Research. 2014. Vol.15. P. 3921−3962.

Kernel-based conditional independence test and application in causal discovery / K.Zhang, J. Peters, D. Janzing, B. Schölkopf. / Proc. of the 27th Conf. on Uncertainty in Artificial Intelligence, (UAI-2011). Corvallis, Oregon: AUAI Press, 2011. P. 804-813.

Balabanov A.S. Minimal separators in dependency structures: Properties and identification. Cybernetics and Systems Analysis. 2008. Vol. 44, N 6. P. 803-815. CrossRef

Balabanov O.S. Vidtvorennya kauzalnych merezh na osnovi analizu markovskich vlastyvostej [Reconstruction of causal networks via analysis of Markov properties]. Mathematical Machines and Systems. 2016. N 1. P. 16-26. [In Ukrainian]

Granger C.W.J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica. 1969. Vol. 37. P. 424-459. CrossRef

Swanson N.R. and Granger C.W.J. Impulse response functions based on a causal approach to residual orthogonalization in vector autoregressions. J. of the American Statistical Association. 1997. Vol. 92, N 437, P. 357-367. CrossRef

Gong M., Zhang K., Schölkopf B., Tao D. and Geiger P. Discovering temporal causal relations from subsampled data. Proc. of the 32nd Intern. Conf. on Machine Learning, 2015. P. 1898-1906.

Malinsky D. and Spirtes P. Learning the structure of a nonstationary vector autoregression. The 22nd Intern. Conf. on Artificial Intelligence and Statistics. Proc. of Machine Learning Research, PMLR, 2019, Vol. 89. P. 2986-2994.

Harford T. Big data: A big mistake? Significance. 2014. Vol. 11, N 5. P. 14-19. CrossRef

Bühlmann P. and van de Geer S. Statistics for high-dimensional data: Methods, theory and applications. Springer, 2011. 556 p. CrossRef

Donoho D.L. High-dimensional data analysis: the curses and blessings of dimensionality - In: American Mathematical Society Conf. "Math Challenges of the 21st Century", 2000, Los Angeles. P. 1-32.

Bareinboim E., Tian J., Pearl J. Recovering from selection bias in causal and statistical inference. Proc. of the 28th AAAI Conf. on Artificial Intelligence. 2014. P. 2419-2416. (July 27-31, 2014, Québec Convention Center, Québec City, Québec, Canada).




DOI: https://doi.org/10.15407/pp2019.03.058

Refbacks

  • There are currently no refbacks.