Computer science
Time Series Algorithms in Machine Learning - A Graph Approach to Multivariate Forecasting
Forecasting future values of time series has long been a field with many and varied applications, from climate and weather forecasting to stock prediction and economic planning to the control of industrial processes. Many of these problems involve not only a single time series but many simultaneous series which may influence each other. This thesis provides methods based on machine learning of handling such problems.
We first consider single time series with both single and multiple features. We review the algorithms and unique challenges involved in applying machine learning to time series. Many machine learning algorithms when used for regression are designed to produce a single output value for each timestamp of interest with no measure of confidence; however, evaluating the uncertainty of the predictions is an important component for practical forecasting. We therefore discuss methods of constructing uncertainty estimates in the form of prediction intervals for each prediction. Stability over long time horizons is also a concern for these algorithms as recursion is a common method used to generate predictions over long time intervals. To address this, we present methods of maintaining stability in the forecast even over large time horizons. These methods are applied to an electricity forecasting problem where we demonstrate the effectiveness for support vector machines, neural networks and gradient boosted trees.
We next consider spatiotemporal problems, which consist of multiple interlinked time series, each of which may contain multiple features. We represent these problems using graphs, allowing us to learn relationships using graph neural networks. Existing methods of doing this generally make use of separate time and spatial (graph) layers, or simply replace operations in temporal layers with graph operations. We show that these approaches have difficulty learning relationships that contain time lags of several time steps. To address this, we propose a new layer inspired by the long-short term memory (LSTM) recurrent neural network which adds a distinct memory state dedicated to learning graph relationships while keeping the original memory state. This allows the model to consider temporally distant events at other nodes without affecting its ability to model long-term relationships at a single node. We show that this model is capable of learning the long-term patterns that existing models struggle with. We then apply this model to a number of real-world bike-share and traffic datasets where we observe improved performance when compared to other models with similar numbers of parameters.
Author Keywords: forecasting, graph neural network, LSTM, machine learning, neural network, time series
Characteristics of Models for Representation of Mathematical Structure in Typesetting Applications and the Cognition of Digitally Transcribing Mathematics
The digital typesetting of mathematics can present many challenges to users, especially those of novice to intermediate experience levels. Through a series of experiments, we show that two models used to represent mathematical structure in these typesetting applications, the 1-dimensional structure based model and the 2-dimensional freeform model, cause interference with users' working memory during the process of transcribing mathematical content. This is a notable finding as a connection between working memory and mathematical performance has been established in the literature. Furthermore, we find that elements of these models allow them to handle various types of mathematical notation with different degrees of success. Notably, the 2-dimensional freeform model allows users to insert and manipulate exponents with increased efficiency and reduced cognitive load and working memory interference while the 1-dimensional structure based model allows for handling of the fraction structure with greater efficiency and decreased cognitive load.
Author Keywords: mathematical cognition, mathematical software, user experience, working memory
An Investigation of the Impact of Big Data on Bioinformatics Software
As the generation of genetic data accelerates, Big Data has an increasing impact on the way bioinformatics software is used. The experiments become larger and more complex than originally envisioned by software designers. One way to deal with this problem is to use parallel computing.
Using the program Structure as a case study, we investigate ways in which to counteract the challenges created by the growing datasets. We propose an OpenMP and an OpenMP-MPI hybrid parallelization of the MCMC steps, and analyse the performance in various scenarios.
The results indicate that the parallelizations produce significant speedups over the serial version in all scenarios tested. This allows for using the available hardware more efficiently, by adapting the program to the parallel architecture. This is important because not only does it reduce the time required to perform existing analyses, but it also opens the door to new analyses, which were previously impractical.
Author Keywords: Big Data, HPC, MCMC, parallelization, speedup, Structure
An Investigation of Load Balancing in a Distributed Web Caching System
With the exponential growth of the Internet, performance is an issue as bandwidth is often limited. A scalable solution to reduce the amount of bandwidth required is Web caching. Web caching (especially at the proxy-level) has been shown to be quite successful at addressing this issue. However as the number and needs of the clients grow, it becomes infeasible and inefficient to have just a single Web cache. To address this concern, the Web caching system can be set up in a distributed manner, allowing multiple machines to work together to meet the needs of the clients. Furthermore, it is also possible that further efficiency could be achieved by balancing the workload across all the Web caches in the system. This thesis investigates the benefits of load balancing in a distributed Web caching environment in order to improve the response times and help reduce bandwidth.
Author Keywords: adaptive load sharing, Distributed systems, Load Balancing, Simulation, Web Caching
ADAPT: An Automated Decision Support Tool For Adaptation To Climate Change-Driven Floods Predicted From A Multiscale And Multi-Model Framework
This thesis focuses on the design of a modelling framework consisting of loose-coupling of a sequence of spatial and process models and procedures necessary to predict future flood events for the years 2030 and 2050 in Tabasco Mexico. Temperature and precipitation data from the Hadley Centers Coupled Model (HadCM3), for those future years were downscaled using the Statistical Downscaling Model (SDSM4.2.9). These data were then used along with a variety of digital spatial data and models (current land use, soil characteristics, surface elevation and rivers) to parameterize the Soil Water Assessment Tool (SWAT) model and predict flows. Flow data were then input into the Hydrological Engineering Centers-River Analysis System (HEC-RAS) model. This model mapped the areas that are expected to be flooded based on the predicted flow values. Results from this modelling sequence generate images of flood extents, which are then ported to an online tool (ADAPT) for display. The results of this thesis indicate that with current prediction of climate change the city of Villahermosa, Tabasco, Mexico, and the surrounding area will experience a substantial amount of flooding. Therefore there is a need for adaptation planning to begin immediately.
Author Keywords: Adaptation Planning, Climate Change, Extreme Weather Events, Flood Planning, Simulation Modelling
Historic Magnetogram Digitization
The conversion of historical analog images to time series data was performed by using deconvolution for pre-processing, followed by the use of custom built digitization algorithms. These algorithms have been developed to be user friendly with the objective of aiding in the creation of a data set from decades of mechanical observations collected from the Agincourt and Toronto geomagnetic observatories beginning in the 1840s. The created algorithms follow a structure which begins with pre-processing followed by tracing and pattern detection. Each digitized magnetogram was then visually inspected, and the algorithm performance verified to ensure accuracy, and to allow the data to later be connected to create a long-running time-series.
Author Keywords: Magnetograms
Augmented Reality Sandbox (Aeolian Box): A Teaching and Presentation Tool for Atmospheric Boundary Layer Airflows over a Deformable Surface
The AeolianBox is an educational and presentation tool extended in this thesis to
represent the atmospheric boundary layer (ABL) flow over a deformable surface in the
sandbox. It is a hybrid hardware cum mathematical model which helps users to visually,
interactively and spatially fathom the natural laws governing ABL airflow. The
AeolianBox uses a Kinect V1 camera and a short focal length projector to capture the
Digital Elevation Model (DEM) of the topography within the sandbox. The captured
DEM is used to generate a Computational Fluid Dynamics (CFD) model and project the
ABL flow back onto the surface topography within the sandbox.
AeolianBox is designed to be used in a classroom setting. This requires a low
time cost for the ABL flow simulation to keep the students engaged in the classroom.
Thus, the process of DEM capture and CFD modelling were investigated to lower the
time cost while maintaining key features of the ABL flow structure. A mesh-time
sensitivity analysis was also conducted to investigate the tradeoff between the number of
cells inside the mesh and time cost for both meshing process and CFD modelling. This
allows the user to make an informed decision regarding the level of detail desired in the
ABL flow structure by changing the number of cells in the mesh.
There are infinite possible surface topographies which can be created by molding
sand inside the sandbox. Therefore, in addition to keeping the time cost low while
maintaining key features of the ABL flow structure, the meshing process and CFD
modelling are required to be robust to variety of different surface topographies.
To achieve these research objectives, in this thesis, parametrization is done for meshing process and CFD modelling.
The accuracy of the CFD model for ABL flow used in the AeolianBox was
qualitatively validated with airflow profiles captured in the Trent Environmental Wind
Tunnel (TEWT) at Trent University using the Laser Doppler Anemometer (LDA). Three
simple geometries namely a hemisphere, cube and a ridge were selected since they are
well studied in academia. The CFD model was scaled to the dimensions of the grid where
the airflow was captured in TEWT. The boundary conditions were also kept the same as
the model used in the AeolianBox.
The ABL flow is simulated by using software like OpenFoam and Paraview to
build and visualize a CFD model. The AeolianBox is interactive and capable of detecting
hands using the Kinect camera which allows a user to interact and change the topography
of the sandbox in real time. The AeolianBox's software built for this thesis uses only
opensource tools and is accessible to anyone with an existing hardware model of its
predecessors.
Author Keywords: Augmented Reality, Computational Fluid Dynamics, Kinect Projector Calibration, OpenFoam, Paraview
Predicting Irregularities in Arrival Times for Toronto Transit Buses with LSTM Recurrent Neural Networks Using Vehicle Locations and Weather Data
Public transportation systems play important role in the quality of life of citizens
in any metropolitan city. However, public transportation authorities face
criticisms from commuters due to irregularities in bus arrival times. For example,
transit bus users often complain when they miss the bus because it arrived too
early or too late at the bus stop. Due to these irregularities, commuters may miss
important appointments, wait for too long at the bus stop, or arrive late for work.
This thesis seeks to predict the occurrence of irregularities in bus arrival times by
developing machine learning models that use GPS locations of transit buses provided
by the Toronto Transit Commission (TTC) and hourly weather data. We
found that in nearly 37% of the time, buses either arrive early or late by more than
5 minutes, suggesting room for improvement in the current strategies employed by
transit authorities. We compared the performance of three machine learning models,
for which our Long Short-Term Memory (LSTM) [13] model outperformed all
other models in terms of accuracy. The error rate for LSTM model was the lowest
among Artificial Neural Network (ANN) and support vector regression (SVR). The
improved accuracy achieved by LSTM is due to its ability to adjust and update the
weights of neurons while maintaining long-term dependencies when encountering
new stream of data.
Author Keywords: ANN, LSTM, Machine Learning
Support Vector Machines for Automated Galaxy Classification
Support Vector Machines (SVMs) are a deterministic, supervised machine learning algorithm that have been successfully applied to many areas of research. They are heavily grounded in mathematical theory and are effective at processing high-dimensional data. This thesis models a variety of galaxy classification tasks using SVMs and data from the Galaxy Zoo 2 project. SVM parameters were tuned in parallel using resources from Compute Canada, and a total of four experiments were completed to determine if invariance training and ensembles can be utilized to improve classification performance. It was found that SVMs performed well at many of the galaxy classification tasks examined, and the additional techniques explored did not provide a considerable improvement.
Author Keywords: Compute Canada, Kernel, SDSS, SHARCNET, Support Vector Machine, SVM
Fraud Detection in Financial Businesses Using Data Mining Approaches
The purpose of this research is to apply four methods on two data sets, a Synthetic
dataset and a Real-World dataset, and compare the results to each other with the
intention of arriving at methods to prevent fraud. Methods used include Logistic Regression,
Isolation Forest, Ensemble Method and Generative Adversarial Networks.
Results show that all four models achieve accuracies between 91% and 99% except
Isolation Forest gave 69% accuracy for the Synthetic dataset.
The four models detect fraud well when built on a training set and tested with
a test set. Logistic Regression achieves good results with less computational eorts.
Isolation Forest achieve lower results accuracies when the data is sparse and not preprocessed
correctly. Ensemble Models achieve the highest accuracy for both datasets.
GAN achieves good results but overts if a big number of epochs was used. Future
work could incorporate other classiers.
Author Keywords: Ensemble Method, GAN, Isolation forest, Logistic Regression, Outliers