Rui Araújo – Datasets

Repositories
Below there are some links for some repository data sets for machine learning tasks, including regression tasks, classification tasks, etc. The datasets given below include some soft sensors datasets (which is one of my areas of research). The datasets are also discriminated regarding if they are static or dynamic and if they come from a soft sensors application or not.


General Datasets Repositories

  • Weka database - contains several regression datasets from different sources;

  • UCI database - contains several regression, classification, and clustering datasets, etc;

  • Luís Torgo repository - contains many regression datasets, including some of the well-know regression datasets present in the UCI database;

  • Delve repository - This repository contains some regression datasets and classification datasets, many of them are also included in the Luís Torgo database;

  • CoEPrA 2006 - this repository contains high dimensional regression datasets and classification datasets based on the CoEPRA competition.


Soft Sensors Datasets

  • WWTP:

    • Download: here;

    • Description: Stationary; Extracted from a real WWTP plant, more info can be found in page 7 here or here;

    • Data Info: Continuous; Stationary; Number of inputs: 8; Number of samples: 1000; Output: Fluorine at efluent stage;

    • Objective: Predict fluorine at the efluent stage;

    • In case of publication please cite: Francisco Souza, Rui Araújo, Tiago Matias, Jérôme Mendes. A Multilayer-Perceptron Based Method for Variable Selection in Soft Sensor Design. Journal of Process Control, 23(10):1371-1378, November 2013. [ bib | DOI | .pdf ].

  • SRU Unit:

    • Download: Data for SRU Unit and Debutanizer Column (original link);

    • Description: Stationary; Extracted from a real Debutanizer plant, more info can be found in page XX of Fortuna et al. Book;

    • Data Info: Continuous; Stationary; Number of inputs: 7; Number of samples: 2393; Output: Butane concentration;

    • Objective: Predict the butane concentration on a Debutanizer column;

    • In case of publication cite the original book: Fortuna et al. Book.

  • Fluidized Catalytic Cracking Unit (FCCU) data set;

    • Description: regression dataset;

    • Further information about this dataset and its application in our research can be found at the following publication: Symone Gomes Soares and Rui Araújo. An on-line weighted ensemble of regressor models to handle concept drifts. Engineering Applications of Artificial Intelligence, 37:392–406, January 2015. [ bib | DOI | .pdf ]


Regression Datasets

  • Nissan Leaf electric vehicle regenerative braking data sets;

    • Description: regression datasets;

    • Further information about this dataset and its application in our research can be found at the following publication; in case of publication please cite: Ricardo Maia, Marco Silva, Rui Araújo, and Urbano Nunes. Electrical vehicle modeling: A fuzzy logic model for regenerative braking. Expert Systems with Applications, 42(22):8504–8519, December 2015. [ bib | DOI | .pdf ]

  • The Toolbox of Regression Datasets for Concept Drifts;

    • Description: regression datasets for concept drift research;

    • Further information can be found at the following publication: Symone Gomes Soares and Rui Araújo. An on-line weighted ensemble of regressor models to handle concept drifts. Engineering Applications of Artificial Intelligence, 37:392–406, January 2015. [ bib | DOI | .pdf ]

    • Associated software: The OWE Toolbox;

    • In case of publication please cite: Symone Gomes Soares and Rui Araújo. An on-line weighted ensemble of regressor models to handle concept drifts. Engineering Applications of Artificial Intelligence, 37:392–406, January 2015. [ bib | DOI | .pdf ]

  • Boston Housing data set;

    • Description: regression dataset, also available here;

    • Further information about this dataset and its application in our research can be found at the following publication: Symone G. Soares, Carlos H. Antunes, and Rui Araújo. A genetic algorithm for designing neural network ensembles. In Proc. Genetic and Evolutionary Computation Conference (GECCO 2012), a recombination of the 21st International Conference on Genetic Algorithms (ICGA), and the 17th Annual Genetic Programming Conference (GP), pages 681–688, Philadelphia, USA, July 07-11 2012. ACM. [ bib | DOI | .pdf ]

  • Friedman Artificial Domain data set;

    • Description: regression dataset, also available here;

    • Further information about this dataset and its application in our research can be found at the following publication: Symone G. Soares, Carlos H. Antunes, and Rui Araújo. A genetic algorithm for designing neural network ensembles. In Proc. Genetic and Evolutionary Computation Conference (GECCO 2012), a recombination of the 21st International Conference on Genetic Algorithms (ICGA), and the 17th Annual Genetic Programming Conference (GP), pages 681–688, Philadelphia, USA, July 07-11 2012. ACM. [ bib | DOI | .pdf ]

  • Box-Jenkins gas furnace process data set;

    • Description: regression dataset;

    • Further information about this dataset and its application in our research can be found at the following publication: Jérôme Mendes, Francisco Souza, Rui Araújo, and Nuno Gonçalves. Genetic fuzzy system for data-driven soft sensors design. Applied Soft Computing, 12(10):3237–3245, October 2012. [ bib | DOI | .pdf ]



University of Coimbra  Department of Electrical and Computer Engineering - University of Coimbra  Institute of Systems and Robotics - University of Coimbra