ISSN : ISSN No. 2472-1956
1School of Mathematics and Statistics , University of Sydney, Australia
2Centre for Applied Finance, University of South Australia, Australia
It is a great pleasure and honour for me to be invited to work with the Foundation Editor-in-Chief and the other Members of the Editorial Board of the Journal of Informatics and Data Mining, and to contribute an Opinion article to the inaugural issue.
The statement of Editorial Intent by the Foundation Editor-in- Chief has indicated the tremendous scope and range of subject coverage provided by this exciting new Open Access journal. In this Opinion article, I will limit myself to a brief discussion to the coverage of a limited selection of free open-source software that might assist analysis in the context of applications, and analysis in this massive field of potential subjects in modelling, informatics, data-mining, and in the econometric and statistical exploration of datasets.
I commence with drop-down menu driven programs, and then proceed to more sophisticated programs such as R, which provides a language environment. My personal preference for open source software started in recent years while teaching econometrics at an Australian university. I became increasingly frustrated by what I viewed as the excessive time and trouble involved in the process of arranging access to commercial, and relatively expensive, software to be made available for the use of students in university computing laboratories.
The ready solution that I eventually adopted was to switch to open source software, which was freely available. For teaching purposes and basic economic analysis, I found GRETL to be exceedingly useful (https://gretl.sourceforge.net/). GRETL is a cross-platform software package for econometric analysis, written in the C programming language. It is free, open-source software. Its capabilities continually expand, and now include, as taken verbatim from the GRETL website:
Easy intuitive interface (now in French, Italian, Spanish, Polish, German, Basque, Catalan, Galician, Portuguese, Russian, Turkish, Czech, Traditional Chinese, Albanian, Bulgarian, Greek, Japanese and Romanian as well as English).
A wide variety of estimators: least squares, maximum likelihood, GMM; single-equation and system methods.
Time series methods: ARIMA, a wide variety of univariate GARCHtype models, VARs and VECMs (including structural VARs), unitroot and cointegration tests, Kalman filter, etc.
Limited dependent variables: logit, probit, tobit, sample selection, interval regression, models for count and duration data, etc.
Panel-data estimators, including instrumental variables, probit and GMM-based dynamic panel models.
Output models as LaTeX files, in tabular or equation format.
Integrated powerful scripting language (known as hansl), with a wide range of programming tools and matrix operations.
GUI controller for fine-tuning Gnuplot graphs.
An expanding range of contributed function packages, written in hansl.
Facilities for easy exchange of data and results with GNU R, GNU Octave, Python, Ox and Stata.
As the list above suggests, much can be accomplished with GRETL.
An alternative program that is focused predominantly on time series analysis is JMulTi (https://www.jmulti.de/). This program is an interactive software designed for univariate and multivariate time series analysis. It has a Java graphical user interface that uses an external engine for statistical computations. It is particularly useful for VAR, VEC, SVAR, SVEC, STR, and nonparametric time series modelling.
For the purposes of Data Mining, it is hard to ignore Weka (Waikato Environment for Knowledge Analysis). This comprises a suite of machine learning software written in Java, which was developed at the University of Waikato, New Zealand: (https:// www.cs.waikato.ac.nz/ml/weka/).
Again it is available under the GNU General Public License. The program features a collection of visualization tools and algorithms for data analysis and predictive modelling. There are useful graphical user interfaces for access to a variety of functions. Another possibility for neural networks applications is Neural Designer (https://www.artelnics.com/).
A more general and flexible environment is provided by Scilab (https://www.scilab.org/). Scilab is a French initiative, and is software designed for numerical computation which provides a powerful computing environment for engineering and scientific applications. Together with Gnu Octave (https://gnu.org/ software/octave/), it provides an open source alternative to Matlab. Scilab is used in French high schools as a mathematics tool. It has wide functionality, including maths and simulation, graphics visualization, in 2-D and 3-D formats, various optimization algorithms, a wide range of statistical functions, signal processing capabilities, control system design and analytics, application development features, and Xcos, a hybrid dynamic systems modeller and simulator.
My personal favourite as a general statistical computing environment is R, which can be considered as a different implementation of S (https://cran.r-project.org/). R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and graphical techniques, and is highly extensible. The S language is often a preferred choice for research in statistical methodology, and R provides an Open Source route to participation.
One of the strengths of R is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae, where needed. R follows a type inference coding structure and provides a wide variety of statistical and graphical techniques, including:
Linear and non-linear modelling
Univariate & Multivariate Statistics
Classical statistical tests
Time-series analysis/ Econometrics
Simulation and Modelling
Datamining-classification, clustering etc.
For computationally intensive tasks, C, C++, and Fortran code can be linked and called at run time. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. At the time of writing, the CRAN package repository featured 7161 available packages. The figure below, which shows a screenshot taken from Cran Task Views, shows the amazing range of application areas.
R Studio, an integrated development environment (IDE), is a powerful and productive user interface for R. R Studio provides various additional functionalities to basic R interface, which makes R Studio an attractive environment for running R (https:// www.rstudio.com/).
In this Opinion article, I have referenced some of personal favourites in the free open-source software environment with potential applications to the some of the subject areas covered by the Journal of Informatics and Data Mining. I hope it may be of interest to the readers of the journal.
