Graduate Theses & Dissertations

Fraud Detection in Financial Businesses Using Data Mining Approaches
The purpose of this research is to apply four methods on two data sets, a Synthetic dataset and a Real-World dataset, and compare the results to each other with the intention of arriving at methods to prevent fraud. Methods used include Logistic Regression, Isolation Forest, Ensemble Method and Generative Adversarial Networks. Results show that all four models achieve accuracies between 91% and 99% except Isolation Forest gave 69% accuracy for the Synthetic dataset. The four models detect fraud well when built on a training set and tested with a test set. Logistic Regression achieves good results with less computational eorts. Isolation Forest achieve lower results accuracies when the data is sparse and not preprocessed correctly. Ensemble Models achieve the highest accuracy for both datasets. GAN achieves good results but overts if a big number of epochs was used. Future work could incorporate other classiers. Author Keywords: Ensemble Method, GAN, Isolation forest, Logistic Regression, Outliers
Historic Magnetogram Digitization
The conversion of historical analog images to time series data was performed by using deconvolution for pre-processing, followed by the use of custom built digitization algorithms. These algorithms have been developed to be user friendly with the objective of aiding in the creation of a data set from decades of mechanical observations collected from the Agincourt and Toronto geomagnetic observatories beginning in the 1840s. The created algorithms follow a structure which begins with pre-processing followed by tracing and pattern detection. Each digitized magnetogram was then visually inspected, and the algorithm performance verified to ensure accuracy, and to allow the data to later be connected to create a long-running time-series. Author Keywords: Magnetograms
Support Vector Machines for Automated Galaxy Classification
Support Vector Machines (SVMs) are a deterministic, supervised machine learning algorithm that have been successfully applied to many areas of research. They are heavily grounded in mathematical theory and are effective at processing high-dimensional data. This thesis models a variety of galaxy classification tasks using SVMs and data from the Galaxy Zoo 2 project. SVM parameters were tuned in parallel using resources from Compute Canada, and a total of four experiments were completed to determine if invariance training and ensembles can be utilized to improve classification performance. It was found that SVMs performed well at many of the galaxy classification tasks examined, and the additional techniques explored did not provide a considerable improvement. Author Keywords: Compute Canada, Kernel, SDSS, SHARCNET, Support Vector Machine, SVM
Predicting Irregularities in Arrival Times for Toronto Transit Buses with LSTM Recurrent Neural Networks Using Vehicle Locations and Weather Data
Public transportation systems play important role in the quality of life of citizens in any metropolitan city. However, public transportation authorities face criticisms from commuters due to irregularities in bus arrival times. For example, transit bus users often complain when they miss the bus because it arrived too early or too late at the bus stop. Due to these irregularities, commuters may miss important appointments, wait for too long at the bus stop, or arrive late for work. This thesis seeks to predict the occurrence of irregularities in bus arrival times by developing machine learning models that use GPS locations of transit buses provided by the Toronto Transit Commission (TTC) and hourly weather data. We found that in nearly 37% of the time, buses either arrive early or late by more than 5 minutes, suggesting room for improvement in the current strategies employed by transit authorities. We compared the performance of three machine learning models, for which our Long Short-Term Memory (LSTM) [13] model outperformed all other models in terms of accuracy. The error rate for LSTM model was the lowest among Artificial Neural Network (ANN) and support vector regression (SVR). The improved accuracy achieved by LSTM is due to its ability to adjust and update the weights of neurons while maintaining long-term dependencies when encountering new stream of data. Author Keywords: ANN, LSTM, Machine Learning
Exploring the Scalability of Deep Learning on GPU Clusters
In recent years, we have observed an unprecedented rise in popularity of AI-powered systems. They have become ubiquitous in modern life, being used by countless people every day. Many of these AI systems are powered, entirely or partially, by deep learning models. From language translation to image recognition, deep learning models are being used to build systems with unprecedented accuracy. The primary downside, is the significant time required to train the models. Fortunately, the time needed for training the models is reduced through the use of GPUs rather than CPUs. However, with model complexity ever increasing, training times even with GPUs are on the rise. One possible solution to ever-increasing training times is to use parallelization to enable the distributed training of models on GPU clusters. This thesis investigates how to utilise clusters of GPU-accelerated nodes to achieve the best scalability possible, thus minimising model training times. Author Keywords: Compute Canada, Deep Learning, Distributed Computing, Horovod, Parallel Computing, TensorFlow
Cloud Versus Bare Metal
A comparison of two high performance computing clusters running on AWS and Sharcnet was done to determine which scenarios yield the best performance. Algorithm complexity ranged from O (n) to O (n3). Data sizes ranged from 195 KB to 2 GB. The Sharcnet hardware consisted of Intel E5-2683 and Intel E7-4850 processors with memory sizes ranging from 256 GB to 3072 GB. On AWS, C4.8xlarge instances were used, which run on Intel Xeon E5-2666 processors with 60 GB per instance. AWS was able to launch jobs immediately regardless of job size. The only limiting factors on AWS were algorithm complexity and memory usage, suggesting a memory bottleneck. Sharcnet had the best performance but could be hampered by the job scheduler. In conclusion, Sharcnet is best used when the algorithm is complex and has high memory usage. AWS is best used when immediate processing is required. Author Keywords: AWS, cloud, HPC, parallelism, Sharcnet
Machine Learning Using Topology Signatures For Associative Memory
This thesis presents a technique to produce signatures from topologies generated by the Growing Neural Gas algorithm. The generated signatures have the following characteristics: The signature's memory footprint is smaller than the "real object" and it represents a point in the n x m multidimensional space. Signatures can be compared based on Euclidean distance and distances between signatures provide measurements of differences between models. Signatures can be associated with a concept and then be used as a learning step for a classification algorithm. The signatures are normalized and vectorized to be used in a multidimensional space clustering. Although the technique is generic in essence, it was tested by classifying alphabet and numerical handwritten characters and 2D figures obtaining a good accuracy and precision. It can be used for many other purposes related to shapes and abstract typologies classification and associative memory. Future work could incorporate other classifiers. Author Keywords: Associative memory, Character recognition, Machine learning, Neural gas, Topological signatures, Unsupervised learning
Utilizing Class-Specific Thresholds Discovered by Outlier Detection
We investigated if the performance of selected supervised machine-learning techniques could be improved by combining univariate outlier-detection techniques and machine-learning methods. We developed a framework to discover class-specific thresholds in class probability estimates using univariate outlier detection and proposed two novel techniques to utilize these class-specific thresholds. These proposed techniques were applied to various data sets and the results were evaluated. Our experimental results suggest that some of our techniques may improve recall in the base learner. Additional results suggest that one technique may produce higher accuracy and precision than AdaBoost.M1, while another may produce higher recall. Finally, our results suggest that we can achieve higher accuracy, precision, or recall when AdaBoost.M1 fails to produce higher metric values than the base learner. Author Keywords: AdaBoost, Boosting, Classification, Class-Specific Thresholds, Machine Learning, Outliers
SPAF-network with Saturating Pretraining Neurons
In this work, various aspects of neural networks, pre-trained with denoising autoencoders (DAE) are explored. To saturate neurons more quickly for feature learning in DAE, an activation function that offers higher gradients is introduced. Moreover, the introduction of sparsity functions applied to the hidden layer representations is studied. More importantly, a technique that swaps the activation functions of fully trained DAE to logistic functions is studied, networks trained using this technique are reffered to as SPAF-networks. For evaluation, the popular MNIST dataset as well as all \(3\) sub-datasets of the Chars74k dataset are used for classification purposes. The SPAF-network is also analyzed for the features it learns with a logistic, ReLU and a custom activation function. Lastly future roadmap is proposed for enhancements to the SPAF-network. Author Keywords: Artificial Neural Network, AutoEncoder, Machine Learning, Neural Networks, SPAF network, Unsupervised Learning
An Investigation of Load Balancing in a Distributed Web Caching System
With the exponential growth of the Internet, performance is an issue as bandwidth is often limited. A scalable solution to reduce the amount of bandwidth required is Web caching. Web caching (especially at the proxy-level) has been shown to be quite successful at addressing this issue. However as the number and needs of the clients grow, it becomes infeasible and inefficient to have just a single Web cache. To address this concern, the Web caching system can be set up in a distributed manner, allowing multiple machines to work together to meet the needs of the clients. Furthermore, it is also possible that further efficiency could be achieved by balancing the workload across all the Web caches in the system. This thesis investigates the benefits of load balancing in a distributed Web caching environment in order to improve the response times and help reduce bandwidth. Author Keywords: adaptive load sharing, Distributed systems, Load Balancing, Simulation, Web Caching
An Investigation of the Impact of Big Data on Bioinformatics Software
As the generation of genetic data accelerates, Big Data has an increasing impact on the way bioinformatics software is used. The experiments become larger and more complex than originally envisioned by software designers. One way to deal with this problem is to use parallel computing. Using the program Structure as a case study, we investigate ways in which to counteract the challenges created by the growing datasets. We propose an OpenMP and an OpenMP-MPI hybrid parallelization of the MCMC steps, and analyse the performance in various scenarios. The results indicate that the parallelizations produce significant speedups over the serial version in all scenarios tested. This allows for using the available hardware more efficiently, by adapting the program to the parallel architecture. This is important because not only does it reduce the time required to perform existing analyses, but it also opens the door to new analyses, which were previously impractical. Author Keywords: Big Data, HPC, MCMC, parallelization, speedup, Structure
Self-Organizing Maps and Galaxy Evolution
Artificial Neural Networks (ANN) have been applied to many areas of research. These techniques use a series of object attributes and can be trained to recognize different classes of objects. The Self-Organizing Map (SOM) is an unsupervised machine learning technique which has been shown to be successful in the mapping of high-dimensional data into a 2D representation referred to as a map. These maps are easier to interpret and aid in the classification of data. In this work, the existing algorithms for the SOM have been extended to generate 3D maps. The higher dimensionality of the map provides for more information to be made available to the interpretation of classifications. The effectiveness of the implementation was verified using three separate standard datasets. Results from these investigations supported the expectation that a 3D SOM would result in a more effective classifier. The 3D SOM algorithm was then applied to an analysis of galaxy morphology classifications. It is postulated that the morphology of a galaxy relates directly to how it will evolve over time. In this work, the Spectral Energy Distribution (SED) will be used as a source for galaxy attributes. The SED data was extracted from the NASA Extragalactic Database (NED). The data was grouped into sample sets of matching frequencies and the 3D SOM application was applied as a morphological classifier. It was shown that the SOMs created were effective as an unsupervised machine learning technique to classify galaxies based solely on their SED. Morphological predictions for a number of galaxies were shown to be in agreement with classifications obtained from new observations in NED. Author Keywords: Galaxy Morphology, Multi-wavelength, parallel, Self-Organizing Maps

Search Our Digital Collections


Enabled Filters

  • (-) ≠ Reid
  • (-) ≠ Morrison
  • (-) ≠ Canadian studies
  • (-) ≠ Environmental science
  • (-) ≠ Business education
  • (-) = Computer science
  • (-) ≠ Medical imaging