Tasks and methods of Big Data analysis (a survey)

O.S. Balabanov

Abstract


We review tasks and methods most relevant to Big Data analysis. Emphasis is made on the conceptual and pragmatic issues of the tasks and methods (avoiding unnecessary mathematical details). We suggest that all scope of jobs with Big Data fall into four conceptual modes (types): four modes of large-scale usage of Big Data: 1) intelligent information retrieval; 2) massive (large-scale) conveyed data processing (mining); 3) model inference from data; 4) knowledge extraction from data (regularities detection and structures discovery). The essence of various tasks (clustering, regression, generative model inference, structures discovery etc.) are elucidated. We compare key methods of clustering, regression, classification, deep learning, generative model inference and causal discovery. Cluster analysis may be divided into methods based on mean distance, methods based on local distance and methods based on a model. The targeted (predictive) methods fall into two categories: methods which infer a model; "tied to data" methods which compute prediction directly from data. Common tasks of temporal data analysis are briefly overviewed. Among diverse methods of generative model inference we make focus on causal network learning because models of this class are very expressive, flexible and are able to predict effects of interventions under varying conditions. Independence-based approach to causal network inference from data is characterized. We give a few comments on specificity of task of dynamical causal network inference from timeseries. Challenges of Big Data analysis raised by data multidimensionality, heterogeneity and huge volume are presented. Some statistical issues related to the challenges are summarized.

Problems in programming 2019; 3: 58-85


Keywords


Big Data; data analysis; generative model inference; statistical methods; clustering; regression; prediction; pattern discovery; temporal data; causal networks

References


Balabanov O.S. Big Data Analytics: principles, trends and tasks (a survey). Problems in programming. 2019. N 2.

P. 47–68. (ISSN 1727–4907) [In Ukrainian].

Bühlmann P., Drineas P., Kane M., van der Laan M. (eds.) Handbook of Big Data. Taylor and Francis, 2016. 456 p.

Mayer-Schönberger V., Cukier K. Big Data: A revolution that will transform how we live, work, and think. Boston, MA: Houghton Mifflin Harcourt, 2013. 256 p.

Chen C.L.P. and Zhang C.-Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences. 2014. Vol. 275. P. 314–347.

Chen M., Mao S. and Liu Y. Big Data: A Survey. Mobile Networks and Applications. 2014. Vol. 19, Issue 2. P. 171–209.

Bhadani A. and Jothimani D. Big Data: Challenges, opportunities and realities / In.: M.K. Singh and D.G. Kumar (eds.). Effective Big Data management and opportunities for implementation. – IGI Global, Pennsylvania, USA, 2016. – [Елек-тронний ресурс] Доступ: https://arxiv.org/pdf/1705.04928.

Oussous A., Benjelloun F.-Z., Lahcen A.A. and Belfkih S. Big Data technologies: A survey. Journal of King Saud University. Computer and Information Sciences. 2018. Vol. 30, Issue 4. P. 431–448.

Cao L. Data science: a comprehensive overview. ACM Computing Surveys. 2017. Vol. 50, N 3, Article 43, 42 p.

Gandomi A. and Haider M. Beyond the hype: Big data concepts, methods, and analytics. Intern. Jour. of Information Management. 2015. Vol. 35, N 2. Р. 137–144.

Tsai C.-W., Lai C.-F., Chao H.-C. and Vasi-la¬kos A.V. Big data analytics: a survey. Journal of Big Data. 2015. Vol. 2, N 1. P. 1–32.

Watson H.J. Tutorial: Big Data analytics: Concepts, technologies, and applications. Comm. of the Association for Information Systems. 2014. Vol. 34, Article 65. P. 1247–1268.

Fan J., Han F. and Liu H. Challenges of Big Data analysis. Nat. Scient. Rev. 2014., Vol. 1, N 2. P. 293–314.

Franke B., Plante J.-F., Roscher R., Lee E.A., Smyth C., Hatefi A., Chen F., Gil E., Schwing A.G., Selvitella A., Hoffman M.M., Grosse R., Hendricks D. and Reid N. Statistical inference, learning and models in Big Data. Intern. Statistical Review. 2016. Vol. 84, N. 3. P. 371–389.

Zafarani R., Abbasi M.A. and Liu H. Social media mining. An introduction. Cambridge University Press, 2019. 380 p.

Andon P.I. and Balabanov O.S. Vyjavlenie znanij i izyskanija v bazah dannyh. Podhody, modeli, metody i sistemy. Problems in programming. 2000. N 1–2. P. 513–526. (Kyjv, UA). [In Russian].

Balabanov O.S. Knowledge extraction from databases – advanced computer technologies for intellectual data analysis. Mathematical Machines and Systems. 2001. N 1–2. P. 40–54. [In Russian].

Azzalini A. and Scarpa B. Data analysis and Data Mining: An introduction. – N.Y.: Oxford University Press, 2012. 288 p.

Swanson N.R. and Xiong W. Big Data analytics in economics: What have we learned so far, and where should we go from here? Canadian J. of Economics. 2018, Vol. 51, Issue 3. P. 695–746.

Graham E. and Timmermann A. Forecasting in Economics and Finance. Annual Review of Economics. (2016). Vol. 8. P. 81–110.

Weihs C. and Ickstadt K. Data Science: the impact of statistics. Intern. J. of Data Science and Analytics. 2018. Vol. 6. P. 189–194.

The role of statistics in the era of big data. Special issue of the journal: Statistics and Probability Letters. May 2018. Vol. 136.

Secchi P. On the role of statistics in the era of big data: A call for a debate. Statistics and Probability Letters. 2018. Vol. 136. P. 10–14.

Witten I.H., Eibe F., Hall M.A. (3rd ed.).Data mining: practical machine learning tools and techniques. Morgan Kaufmann, 2011. 629 p.

Maimon O., Rokach L. (Eds.) Data Mining and Knowledge Discovery Handbook. 2nd ed., Springer-Verlag New-York Inc., 2010. 1285 p.

Murphy K.P. Machine learning: a probabilistic perspective. MIT Press, Cambridge, Massachusetts, 2012. 1055 p.

Hastie T., Tibshirani R. and Friedman J. The elements of statistical learning. (2nd ed.). Springer. 2009. 745 p.

Efron B. and Hastie T. Computer age statistical inference. Cambridge University Press, 2016. 475 p.

Efron B. Large-scale inference. Stanford University Press, 2010. 263 p.

James G., Witten D., Hastie T. and Tibshirani R. An introduction to statistical learning with applications in R. Springer, N.Y., 2013.

p.

Berkhin P. A survey of clustering data mining techniques. In: Kogan J., Nicholas C., Teboulle M. (eds.). Grouping multidi-mensional data. Springer-Verlag: Berlin-Heidelberg, 2006. P. 25–71.

Bouveyron C., Brunet-Saumard C. Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis. 2014. Vol. 71. P. 52–78.

Kurban H., Jenne M. and Dalkilic M.M. Using data to build a better EM: EM* for big data. Intern. J. of Data Science and Analytics. 2017. Vol. 4, Issue 2. P. 83–97.

LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015. Vol. 521, P.436–444.

Esling P. and Agón C. Time-series data mining. ACM Computing Surveys. 2012. Vol. 45, Issue 1. P. 12–34.

Chandola V., Banerjee A. and Kumar V. Anomaly detection for discrete sequences: a survey. IEEE Trans. on Knowledge and Data Eng. (TKDE). 2012. Vol. 24, N 5. P. 823–839.

Truong C., Oudre L. and Vayatis N. Selective review of offline change point detection methods. [Electronic resource] URL: https://arxiv.org/abs/1801.00718.

Aminikhanghahi S. and Cook D.J. A survey of methods for time series change point detection. Knowledge and Information Systems. 2017. Vol. 51, Issue 2. P. 339–367.

Frick K., Munk A. and Sieling H. Multiscale change point inference. J. Roy. Statist. Soc., ser. B. 2014. Vol. 76, Pt. 3. P. 495–580.

Wang T. and Samworth R.J. High dimensional change point estimation via sparse projection. J. Roy. Statist. Soc., ser. B. 2018. Vol. 80, Pt. 1. P. 57–83.

Liao T.W. Clustering of time series data – a survey. Pattern Recognition. 2005. Vol. 38. P. 1857–1874.

Atluri G., Karpatne A. and Kumar V. Spatio-temporal Data Mining: a survey of problems and methods. ACM Computing Surveys. 2018. Vol. 51, Issue 4, Article N 83.

Lee T.-W., Girolami M., Bell A.J., Sejnowski T.J. A unifying information-theoretic framework for Independent Component Analysis. Intern. J. Computers and Mathematics with Applications. 2000. Vol. 39. P. 1–21.

Neville J. and Jensen D. Relational Dependency Networks. Jour. of Machine Learning Res. 2007. Vol. 8. P. 653–692.

De Raedt L., Kersting K., Natarajan S. and Poole D. Statistical relational artificial intelligence: Logic, probability, and computation. Synthesis Lectures on Artificial Intelligence and Machine Learning. 2016. Vol. 10, N 2. P.1–89.

Kazemi S.M., Buchman D., Kersting K., Natarajan S. and Poole D. Relational logistic regression: The directed analog of Markov logic networks. Workshops at the Twenty-Eighth AAAI Conf. on Artificial Intelligence. 2014. P. 41–43.

Pearl J. Causality: models, reasoning, and inference. Cambridge: Cambridge Univ. Press, 2000. 526 p.

Spirtes P., Glymour C. and Scheines R. Causation, prediction and search. New York: MIT Press, 2001. 543 p.

Peters J., Janzing D. and Schölkopf B. Elements of Causal Inference. Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA, 2017. 265 p.

Balabanov O.S. Knowledge discovery in

data and causal models in analytical in-formatics. Problems in programming. 2017. N 3. P. 96–112. (ISSN 1727–4907). [in Ukrainian].)

Raghu V.K., Ramsey J.D., Morris A., Manatakis D.V., Sprites P., Chrysanthis P.K., Glymour C., Benos P.V. Comparison of strategies for scalable causal discovery of latent variable models from mixed data. Intern. Jour. of Data Science and Analytics. 2018. Vol. 6, Issue 1. P. 33–45.

Tsagris M., Borboudakis G., Lagani V., Tsamardinos I. Constraint-based causal discovery with mixed data. Intern. Jour. of Data Science and Analytics. 2018. Vol. 6, Issue 1. P. 19–30.

Pearl J. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM. 2019. Vol. 62, Issue 3. P. 54–60.

Pearl J. and Bareinboim E. External validity: From do-calculus to transportability across populations. Statistical Science. 2014. Vol. 29, N 4. P. 579–595.

Malinsky D. and Spirtes P. Causal structure learning from multivariate time series in settings with unmeasured confounding. Proc. of 2018 ACM SIGKDD Workshop on Causal Discovery, August 2018, London, UK. PMLR, Vol. 92. P. 23–47.

Entner D. and Hoyer P.O. On causal discovery from time series data using FCI. Proc. of the 5th European Workshop on Probabilistic graphical models. 2010, Helsinki, Finland. P. 121–128.

Runge J. Causal network reconstruction from timeseries: From theoretical assumptions to practical estimation. Chaos. 2018. Vol. 28, paper 075310. 20 p.

Balabanov O.S. Upper bound on the sum of correlations of three indicators under the absence of a common factor. Cybernetics and Systems Analysis. 2019. Vol. 55, N 2. P. 174–185.

Balabanov O.S. From covariation to causation: Discovery of dependency structures in data. System research and information technologies. 2011. N 4, P. 104–118. [In Ukrainian]

Colombo D., Maathuis M.H., Kalisch M. and Richardson T.S. Learning high-dimensional directed acyclic graphs with latent and selection variables. Annals of Statistics. 2012. Vol. 40, Issue 1. P. 294–321.

Colombo D., Maathuis M.H. Order-independent constraint-based causal structure learning. Jour. of Machine Learning Research. 2014. Vol.15. P. 3921−3962.

Kernel-based conditional independence test and application in causal discovery / K.Zhang, J. Peters, D. Janzing, B. Schölkopf. / Proc. of the 27th Conf. on Uncertainty in Artificial Intelligence, (UAI-2011). Corvallis, Oregon: AUAI Press, 2011. P. 804–813.

Balabanov A.S. Minimal separators in dependency structures: Properties and identification. Cybernetics and Systems Analysis. 2008. Vol. 44, N 6. P. 803–815.

Balabanov O.S. Vidtvorennya kauzalnych merezh na osnovi analizu markovskich vlastyvostej [Reconstruction of causal networks via analysis of Markov properties]. Mathematical Machines and Systems. 2016. N 1. P. 16–26. [In Ukrainian]

Granger C.W.J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica. 1969. Vol. 37.

P. 424–459.

Swanson N.R. and Granger C.W.J. Impulse response functions based on a causal approach to residual orthogonalization in vector autoregressions. J. of the American Statistical Association. 1997. Vol. 92, N 437, P. 357–367.

Gong M., Zhang K., Schölkopf B., Tao D. and Geiger P. Discovering temporal causal relations from subsampled data. Proc. of the 32nd Intern. Conf. on Machine Learning, 2015. P. 1898–1906.

Malinsky D. and Spirtes P. Learning the structure of a nonstationary vector autoregression. The 22nd Intern. Conf. on Artificial Intelligence and Statistics. Proc. of Machine Learning Research, PMLR, 2019, Vol. 89. P. 2986–2994.

Harford T. Big data: A big mistake? Significance. 2014. Vol. 11, N 5. P. 14–19.

Bühlmann P. and van de Geer S. Statistics for high-dimensional data: Methods, theory and applications. Springer, 2011. 556 p.

Donoho D.L. High-dimensional data analysis: the curses and blessings of dimensionality – In: American Mathematical Society Conf. “Math Challenges of the 21st Century”, 2000, Los Angeles. P. 1–32.

Bareinboim E., Tian J., Pearl J. Recovering from selection bias in causal and statistical inference. Proc. of the 28th AAAI Conf. on Artificial Intelligence. 2014. P. 2419–2416. (July 27–31, 2014, Québec Convention Center, Québec City, Québec, Canada).


Refbacks

  • There are currently no refbacks.