Flexible Data Reader on Distributed File Systems for Training Deep Learning Algorithms

With the fast-growing size of machine learning datasets, it has become increasingly important to store them in a reliable and distributed manner. Large scale distributed file systems such as GFS, HDFS and Amazon S3 have the capability to store large scale of data reliably. However, these distributed file systems have an intrinsic shortcoming: they provide good read/write access guarantees only for large size files, and therefore cannot efficiently handle frequent read/write operations for large number of small files. In machine learning training protocols, the ability to shuffle data points within a dataset is crucial to avoid local minima and overfitting, which requires the data points to be accessed in a random manner, preferable efficiently. The main focus of this project is to find a way to store machine learning datasets on distributed file systems while maintaining a competitive randomly reading performance for shuffling data points. TO BE CONT’D

Faculty Supervisor:

Yashar Ganjali

Student:

Hongbo Fan

Partner:

Uber Advanced Technologies Group

Discipline:

Computer science

Sector:

Information and communications technologies

University:

Program: