Object detection and pose estimation belong to different tasks in computer vision. Viewed from research methods and practical application, there is great complementarity between these two tasks. This paper presents a mixture of hierarchical tree models that consists of three types of nodes, representing the whole object, discriminative parts and components (i.e. semantic parts) respectively. A key point of the model is that the discriminative parts in the middle level characterize not only object features but also mutual information among components. The proposed model can detect articulated objects and estimate their poses in parallel so as to address the error propagation problem that exists in previous joint models. For training the model, we use a latent structured SVM method where the discriminative nodes are viewed as latent variables. A novel learning method is introduced to initialize and optimize the parameters of the discriminative parts automatically. In experiments we design two evaluation scenarios (i.e. multi-task recognition and single-task recognition) to compare the proposed model and obtain the performance with the state-of-the-art joint methods on PASCAL VOC datasets. The results show that the hierarchical model not only outperforms other joint models in both recognition rate, but also has higher time-effectiveness.