THE EMERGENCE OF "DATA ANALYSIS" AS A SCIENTIFIC DISCIPLINE
Abstract
The article describes the evolution of data analysis from traditional statistics to data science. Starting with Peter Huber's assertion about the empirical nature of data analysis, where the researcher emphasizes that this stage of development cannot be defined as a new scientific paradigm but rather as a tendency unified under the name «data science». The main focus is on the contributions of John Tukey, who first expressed ideas that laid the foundation for data analysis. The article explores the concepts of «confirmatory» and «exploratory» data analysis, defines their goals and differences, and emphasizes the importance of alternating between these stages in the research process. Tukey's principles for contemporary data analysis, such as «maximum insight into the data» and «visualization of patterns», are considered key approaches for discovering new knowledge. Tukey's works sparked significant debates among statisticians, and his views on data analysis shocked the academic community. The impact of Tukey's works on the development of data science over half a century is examined, including comments from the renowned statistician P. Huber. An essential emphasis is placed on the influence of computational environments on the development of data analysis. The role of various statistical packages and software environments, such as BMDP, SPSS, SAS, Minitab, S, STATA, and R, in the evolution of data analysis is discussed. Their impact is assessed through the analysis of word frequencies in the literature, highlighting that R is currently the dominant programming environment in academic statistics with a large number of enthusiasts. The use of scripts to precisely codify computation steps is noted, and these changes are seen as altering the rules of the game, making the expression «scientific approach to data analysis» more evident, aligning with Tukey's assertion about the possibilities of studying data analysis as a science.
References
Huber P.J. Data Analysis: What Can Be Learned From the Past 50 Years. John Wiley & Sons, 2011.
Tukey J.W. The future of data analysis. Annals of Mathematical Statistics. 1962. Vol. 33. № 1. Р. 1–67.
Donoho D. 50 Years of Data Science. Journal of Computational and Graphic Statistics. 2017. No 26(4). Pp. 745–766. DOI: https://doi.org/10.1080/10618600.2017.1384734 (дата звернення: 08.11.2023).
Mosteller F., Tukey J.W. Data Analysis, Including Statistics. Handbook of Social Psychology / Eds. G. Lindzey, E. Aronson. Vol. 2. Reading, MA : Addison-Wesley, 1968. P. 80–203.
Chambers J.M. Greater or Lesser Statistics: A Choice for Future Research. Statistics and Computing. 1993. No. 3. P. 182–184.
Cleveland W.S. Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. International Statistical Review. 2001. No. 69. P. 21–26.
Brillinger D.R., Fernholz L.T., Morgenthaler S. The Practice of Data Analysis: Essays in Honor of John W. Tukey. Princeton, New Jersey : Princeton University Press, 1997. 352 р.
Dempster A.P. John W. Tukey as «philosopher». Annals of Mathematical Statistics. 2002. Vol. 30. № 6. Р. 1619–1628. URL: http://surl.li/ntixf (дата звернення: 08.11.2023).
Kafadar К. John Tukey and Robustness. Statistical Science. 2003. Vol. 18. № 3. Р. 319–331. URL: http://surl.li/ntixn (дата звернення: 08.11.2023).
Кислова О.Н. Интеллектуальный анализ данных: история становления термина. Український соціологічний журнал. 2011. № 1–2. С. 83–94. URL: http://surl.li/ntixs (дата звернення: 08.11.2023).
Google’s N-grams viewer. URL: http://surl.li/ntiyc (дата звернення: 08.11.2023).
Google’s N-grams viewer. URL: http://surl.li/ntiyj (дата звернення: 08.11.2023).
Huber P. J. (2011) Data Analysis: What Can Be Learned From the Past 50 Years. John Wiley & Sons.
Tukey J. W. (1962) The future of data analysis. Annals of Mathematical Statistics, vol. 33. no. 1, pp. 1–67.
Donoho D. (2017) 50 Years of Data Science. Journal of Computational and Graphic Statistics, no. 26(4), pp. 745–766. DOI: https://doi.org/10.1080/10618600.2017.1384734
Mosteller F. & Tukey J. W. (1968) Data Analysis, Including Statistics. Handbook of Social Psychology / Eds. G. Lindzey, E. Aronson, vol. 2. Reading, MA: Addison-Wesley.
Chambers J. M. (1993) Greater or Lesser Statistics: A Choice for Future Research. Statistics and Computing, no. 3, pp. 182–184.
Cleveland W. S. (2001) Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. International Statistical Review, no. 69, pp. 21–26.
Brillinger D. R., Fernholz L. T. & Morgenthaler S. (1997) The Practice of Data Analysis: Essays in Honor of John W. Tukey. Princeton, New Jersey: Princeton University Press.
Dempster A. P. & John W. (2002) Tukey as «philosopher». Annals of Mathematical Statistics, vol. 30, no. 6, pp. 1619–1628. Available at: http://surl.li/ntixf
Kafadar K. (2003) John Tukey and Robustness. Statistical Science, vol. 18, no. 3, pp. 319–331. Available at: http://surl.li/ntixn
Kyslova O. (2011) Intelektualnyy analiz danykh: istoriya rozvytku termina [Data mining: the history of the term]. Ukrayinskyy sotsiolohichnyy zhurnal, no. 1-2, pp. 83–94. Available at: http://surl.li/ntixs
Google’s N-grams viewer. Available at: http://surl.li/ntiyc
Google’s N-grams viewer. Available at: http://surl.li/ntiyj