Energy efficiency is a great challenge in the design of future high performance computers. Since the many-core processor becomes a key choice of future high performance computers, the optimization of its micro-architecture is very important for the improvement of energy efficiency. This paper proposes a pipeline-coupled instruction loop cache for the many-core processor. The instruction loop cache is small sized so that it will provide more energy-efficient instruction storage. As an attempt of implementation-aware micro-architecture research, the loop cache is designed under constraints of hardware costs from the beginning. In order to alleviate the impact to the pipeline performance, the loop cache adopts a prefetching technique. The instruction loop cache prefetches the exit path of the loop into the cache when a loop is detected. The prefetching mechanism guarantees that the design of the loop cache in the pipeline can lead to the improvement of the energy efficiency. The instruction loop cache is implemented in the gem5 simulator. Experiments on a set of SPEC2006 benchmarks show that a typical configuration can reduce on average 27% of instruction fetching power and 31.5% power of the pipeline front-end.