Pipelining is one of useful parallelization techniques for those loops which have cross-processor data dependences. And the pipeline granularity is the key to make the computation time be suitable for communication time and obtain good pipeline performance. Loop strip-mining and loop interchange are good methods to help find the optimal pipeline granularity. And the amount of computation between communication operations in each node is called pipeline granularity or block size. A lot of factors decide the optimal pipeline granularity, such as access mode of application program, program size, total computing node, computation ability and memory architecture of the computing node, performance of communication network, communication mode, and overheads of runtime library, etc. It's hard to assume the block computation time by using static scheme, and the run time scheme will have more extra runtime overhead and may lose more optimization of the application. An approach is presented and realized to compute the pipeline granularity by dynamic profiling and the cost model including the cache locality by loop transform. How to decrease the time of profiling running and guarantee the precision of the cost model is also considered. The results of the experiments prove that the pipeline granularity achieved by dynamic profiling framework has good adaptability and speedup of the execution time of pipelined loop.