In the view of wire delay increase brought by technology development, the distributed and tiled processor architecture becomes increasingly attractive. The controlling signal dispatched by the stream controller of the conventional stream processor faces the increasing wire delay. The cluster consists of a variety of functional units in the conventional stream processor. The wire delay scalability of the centralized communication architecture among clusters is improper. In this paper, a tiled architecture of the stream processor (TPA-PD) is introduced, in which the distributed network is used to connect the tiled components to address the increasing wire delay of the controlling signal. A data-flow-like driven execution model, which is explicit data graph execution, is employed in the kernel level, the dependence relation is encoded in the instruction set, and the centralized communication model of clusters is converted into dynamic dispatching and distributed communication model which is wire-delay scalable. The instruction set, and how to map the stream programming model to the TPAD-PD and microarchitecture are described. Finally, the authors analyze the factor which has an effect on the kernel level execution time on a cycle-accurate simulator, and the TPA-PD achieves an average 20% speedup over traditional stream processor in eight benchmarks.