Captions in videos often play an important role in video information indexing and retrieval. In this paper, a novel video caption detection approach is presented. This approach first applies a new multiple frames integration (MFI) method to reduce the complexity of the background of the image. A time-based minimum (or maximum) pixel value search is employed and a Sobel edge map is used to determine the mode of search. Then block-based text detection is performed, i. e. a small window is used to scan the image and classified as text or non-text, using Sobel edges as features. A two-level pyramid is applied to detect various text sizes. Finally, the approach presents a new iterative text line decomposition method, and accurate text bounding boxes are extracted from the candidate text areas. Experimental results show that the proposed approach achieves a high precision and recall.