As the scale of HPC systems and parallel applications keep increasing, time of large-scale parallel job startup cannot be ignored anymore. Various efforts have been made to improve the performance of program launching and runtime environment initialization. The experiences and results of starting MPI jobs, on Tianhe-1A supercomputer system are presented. Detailed study of the time costs of job startup in different stages, including control message transferring, file access, and MPI environment initialization, shows that for large scale MPI jobs, the environment initialization time dominates the job startup time. Based on this discovery, some preliminary optimization work has been done to reduce the data exchanged during MPI environment initialization and avoid unnecessary data transfer costs. The optimization improves the job startup performance notably. An optimizing process management design with hierarchical structure for MPI environment initialization is proposed to further improve the scalability of job startup. For completeness, we also compare and analyze the job start time of other process management mechanism in main-stream MPI implementations.