Efficient feature representations for images are essential in many computer vision tasks. In this paper, a hierarchical feature representation combined with image saliency is proposed based on the theory of visual saliency and deep learning, which builds a feature hierarchy layer-by-layer. Each feature learning layer is composed of three parts: sparse coding, saliency max pooling and contrast normalization. To speed up the sparse coding process, we propose batch orthogonal matching pursuit which differs from the traditional method. The salient information is introduced into the image sparse representation, which compresses the feature representation and strengthens the semantic information of the feature representation. Simultaneously, contrast normalization effectively reduces the impact of local variations in illumination and foreground-background contrast, and enhances the robustness of the feature representation. Instead of using hand-crafted descriptors, our model learns an effective image representation directly from images in an unsupervised data-driven manner. The final image classification is implemented with a linear SVM classifier using the learned image representation. We compare our method with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer or multi-layer sparse coding methods, and some kernel based feature learning approaches. The experimental results on two commonly used benchmark datasets Caltech 101 and Caltech 256 show that our method consistently and significantly improves the performance.