Applied Modeling and Quantitative Methods
Predicting Irregularities in Arrival Times for Toronto Transit Buses with LSTM Recurrent Neural Networks Using Vehicle Locations and Weather Data
Public transportation systems play important role in the quality of life of citizens
in any metropolitan city. However, public transportation authorities face
criticisms from commuters due to irregularities in bus arrival times. For example,
transit bus users often complain when they miss the bus because it arrived too
early or too late at the bus stop. Due to these irregularities, commuters may miss
important appointments, wait for too long at the bus stop, or arrive late for work.
This thesis seeks to predict the occurrence of irregularities in bus arrival times by
developing machine learning models that use GPS locations of transit buses provided
by the Toronto Transit Commission (TTC) and hourly weather data. We
found that in nearly 37% of the time, buses either arrive early or late by more than
5 minutes, suggesting room for improvement in the current strategies employed by
transit authorities. We compared the performance of three machine learning models,
for which our Long Short-Term Memory (LSTM) [13] model outperformed all
other models in terms of accuracy. The error rate for LSTM model was the lowest
among Artificial Neural Network (ANN) and support vector regression (SVR). The
improved accuracy achieved by LSTM is due to its ability to adjust and update the
weights of neurons while maintaining long-term dependencies when encountering
new stream of data.
Author Keywords: ANN, LSTM, Machine Learning
Support Vector Machines for Automated Galaxy Classification
Support Vector Machines (SVMs) are a deterministic, supervised machine learning algorithm that have been successfully applied to many areas of research. They are heavily grounded in mathematical theory and are effective at processing high-dimensional data. This thesis models a variety of galaxy classification tasks using SVMs and data from the Galaxy Zoo 2 project. SVM parameters were tuned in parallel using resources from Compute Canada, and a total of four experiments were completed to determine if invariance training and ensembles can be utilized to improve classification performance. It was found that SVMs performed well at many of the galaxy classification tasks examined, and the additional techniques explored did not provide a considerable improvement.
Author Keywords: Compute Canada, Kernel, SDSS, SHARCNET, Support Vector Machine, SVM
Fraud Detection in Financial Businesses Using Data Mining Approaches
The purpose of this research is to apply four methods on two data sets, a Synthetic
dataset and a Real-World dataset, and compare the results to each other with the
intention of arriving at methods to prevent fraud. Methods used include Logistic Regression,
Isolation Forest, Ensemble Method and Generative Adversarial Networks.
Results show that all four models achieve accuracies between 91% and 99% except
Isolation Forest gave 69% accuracy for the Synthetic dataset.
The four models detect fraud well when built on a training set and tested with
a test set. Logistic Regression achieves good results with less computational eorts.
Isolation Forest achieve lower results accuracies when the data is sparse and not preprocessed
correctly. Ensemble Models achieve the highest accuracy for both datasets.
GAN achieves good results but overts if a big number of epochs was used. Future
work could incorporate other classiers.
Author Keywords: Ensemble Method, GAN, Isolation forest, Logistic Regression, Outliers
Solving Differential and Integro-Differential Boundary Value Problems using a Numerical Sinc-Collocation Method Based on Derivative Interpolation
In this thesis, a new sinc-collocation method based upon derivative interpolation is developed for solving linear and nonlinear boundary value problems involving differential as well as integro-differential equations. The sinc-collocation method is chosen for its ease of implementation, exponential convergence of error, and ability to handle to singularities in the BVP. We present a unique method of treating boundary conditions and introduce the concept of the stretch factor into the conformal mappings of domains. The result is a method that achieves great accuracy while reducing computational cost. In most cases, the results from the method greatly exceed the published results of comparable methods in both accuracy and efficiency. The method is tested on the Blasius problem, the Lane-Emden problem and generalised to cover Fredholm-Volterra integro-differential problems. The results show that the sinc-collocation method with derivative interpolation is a viable and preferable method for solving nonlinear BVPs.
Author Keywords: Blasius, Boundary Value Problem, Exponential convergence, Integro-differential, Nonlinear, Sinc
Automated Grading of UML Class Diagrams
Learning how to model the structural properties of a problem domain or an object-oriented design in form of a class diagram is an essential learning task in many software engineering courses. Since grading UML assignments is a cumbersome and time-consuming task, there is a need for an automated grading approach that can assist the instructors by speeding up the grading process, as well as ensuring consistency and fairness for large classrooms. This thesis presents an approach for automated grading of UML class diagrams. A metamodel is proposed to establish mappings between the instructor solution and all the solutions for a class, which allows the instructor to easily adjust the grading scheme. The approach uses a grading algorithm that uses syntactic, semantic and structural matching to match a student's solutions with the instructor's solution. The efficiency of this automated grading approach has been empirically evaluated when applied in two real world settings: a beginner undergraduate class of 103 students required to create a object-oriented design model, and an advanced undergraduate class of 89 students elaborating a domain model. The experiment result shows that the grading approach should be configurable so that the grading approach can adapt the grading strategy and strictness to the level of the students and the grading styles of the different instructors. Also it is important to considering multiple solution variants in the grading process. The grading algorithm and tool are proposed and validated experimentally.
Author Keywords: automated grading, class diagrams, model comparison
Problem Solving as a Path to Understanding Mathematics Representations: An Eye-Tracking Study
Little is actually known about how people cognitively process and integrate information when solving complex mathematical problems. In this thesis, eye-tracking was used to examine how people read and integrate information from mathematical symbols and complex formula, with eye fixations being used as a measure of their current focus of attention. Each participant in the studies was presented with a series of stimuli in the form of mathematical problems and their eyes were tracked as they worked through the problem mentally. From these examinations, we were able to demonstrate differences in both the comprehension and problem-solving, with the results suggesting that what information is selected, and how, is responsible for a large portion of success in solving such problems. We were also able to examine how different mathematical representations of the same mathematical object are attended to by students.
Author Keywords: eye-tracking, mathematical notation, mathematical representations, problem identification, problem-solving, symbolism
A Framework for Testing Time Series Interpolators
The spectrum of a given time series is a characteristic function describing its frequency properties. Spectrum estimation methods require time series data to be contiguous in order for robust estimators to retain their performance. This poses a fundamental challenge, especially when considering real-world scientific data that is often plagued by missing values, and/or irregularly recorded measurements. One area of research devoted to this problem seeks to repair the original time series through interpolation. There are several algorithms that have proven successful for the interpolation of considerably large gaps of missing data, but most are only valid for use on stationary time series: processes whose statistical properties are time-invariant, which is not a common property of real-world data. The Hybrid Wiener interpolator is a method that was designed for repairing nonstationary data, rendering it suitable for spectrum estimation. This thesis work presents a computational framework designed for conducting systematic testing on the statistical performance of this method in light of changes to gap structure and departures from the stationarity assumption. A comprehensive audit of the Hybrid Wiener Interpolator against other state-of-the art algorithms will also be explored.
Author Keywords: applied statistics, hybrid wiener interpolator, imputation, interpolation, R statistical software, time series
The Relationship Between Precarious Employment, Behaviour Addictions and Substance Use Among Canadian Young Adults: Insights From The Quinte Longitudinal Survey
This thesis utilized a unique data-set, the Quinte Longitudinal Survey, to explore relationships among precarious employment and a range of mental health problems in a representative sample of Ontario young adults. Study 1 focused on various behavioural addictions (such as problem gambling, video gaming, internet use, exercise, compulsive shopping, and sex) and precarious employment. The results showed that precariously employed men were preoccupied with gambling and sex while their female counterparts preferred shopping. Gambling and excessive shopping diminished over time while excessive sexual practices increased. Study 2 focused on the association between precarious employment and substance abuse (such as tobacco, alcohol, cannabis, hallucinogens, stimulants, and other substances). The results showed that men used cannabis more than women, and the non-precarious employed group abused alcohol more than individuals in the precarious group. This research has implications for both health care professionals and intervention program developers when working with young adults in precarious jobs.
Author Keywords: Behaviour Addictions, Precarious Employment, Substance Abuse, Young Adults
Representation Learning with Restorative Autoencoders for Transfer Learning
Deep Neural Networks (DNNs) have reached human-level performance in numerous tasks in the domain of computer vision. DNNs are efficient for both classification and the more complex task of image segmentation. These networks are typically trained on thousands of images, which are often hand-labelled by domain experts. This bottleneck creates a promising research area: training accurate segmentation networks with fewer labelled samples.
This thesis explores effective methods for learning deep representations from unlabelled images. We train a Restorative Autoencoder Network (RAN) to denoise synthetically corrupted images. The weights of the RAN are then fine-tuned on a labelled dataset from the same domain for image segmentation.
We use three different segmentation datasets to evaluate our methods. In our experiments, we demonstrate that through our methods, only a fraction of data is required to achieve the same accuracy as a network trained with a large labelled dataset.
Author Keywords: deep learning, image segmentation, representation learning, transfer learning
Combinatorial Collisions in Database Matching: With Examples from DNA
Databases containing information such as location points, web searches and fi- nancial transactions are becoming the new normal as technology advances. Conse- quentially, searches and cross-referencing in big data are becoming a common prob- lem as computing and statistical analysis increasingly allow for the contents of such databases to be analyzed and dredged for data. Searches through big data are fre- quently done without a hypothesis formulated before hand, and as these databases grow and become more complex, the room for error also increases. Regardless of how these searches are framed, the data they collect may lead to false convictions. DNA databases may be of particular interest, since DNA is often viewed as significant evi- dence, however, such evidence is sometimes not interpreted in a proper manner in the court room. In this thesis, we present and validate a framework for investigating var- ious collisions within databases using Monte Carlo Simulations, with examples from DNA. We also discuss how DNA evidence may be wrongly portrayed in the court room, and the explanation behind this. We then outline the problem which may occur when numerous types of databases are searched for suspects, and framework to address these problems.
Author Keywords: big data analysis, collisions, database searches, DNA databases, monte carlo simulation