Abstract:
Advances in communication, computation, and storage have created large amounts of data. The ability to collect, organize, and analyze massive amounts of data could lead to breakthroughs in business, science, and society. As a new computing paradigm, cloud computing focuses on Internet service, and Internet service providers have an increasing need to store and analyze massive data sets. In order to perform Web-scale analysis in a cost-effective manner, recently several Internet companies have developed distributed programming systems on large-scale clusters composed of shared-nothing commodity servers, which we call cloud platform. It is a great challenge to design a programming model and system that enables developers to easily write reliable programs that can efficiently utilize cluster-wide resources and achieve maximum degree of parallelism on the cloud platform. Many challenging and exciting research problems arise when trying to scale up the systems and computations to handle terabyte-scale datasets. The recent advance in programming model for massive data processing is reviewed in this context. Firstly, the unique characteristics of data-intensive computing are presented. The fundamental issues of programming model for massive data processing are pointed out. Secondly, several state-of-the-art programming systems for data-intensive computing are described in detail. Thirdly, the pros and cons of the classic programming models are compared and discussed. Finally, the open issues and future work in this field are explored.