Flexible Data Reader on Distributed File Systems for Training Deep Learning Algorithms

With the fast-growing size of machine learning datasets, it has become increasingly important to store them in a reliable and distributed manner. Large scale distributed file systems such as GFS, HDFS and Amazon S3 have the capability to store large scale of data reliably. However, these distributed file systems have an intrinsic shortcoming: they provide good read/write access guarantees only for large size files, and therefore cannot efficiently handle frequent read/write operations for large number of small files. In machine learning training protocols, the ability to shuffle data points within a dataset is crucial to avoid local minima and overfitting, which requires the data points to be accessed in a random manner, preferable efficiently. The main focus of this project is to find a way to store machine learning datasets on distributed file systems while maintaining a competitive randomly reading performance for shuffling data points. TO BE CONT’D

Faculty Supervisor:

Yashar Ganjali


Hongbo Fan


Uber Advanced Technologies Group


Computer science


Information and communications technologies




Current openings

Find the perfect opportunity to put your academic skills and knowledge into practice!

Find Projects