Advanced Search

    2023  Vol. 60  No. 6

    column
    Perspective
    Abstract:

    ChatGPT has been a significant breakthrough and drawn widespread attention. ChatGPT’s role in AI development and its future impact is examined in this paper. We first introduce ChatGPT’s exceptional dialogue generation capabilities, enabling it to handle nearly all natural language processing tasks and be applied as a data generator, knowledge mining tool, model dispatcher, and natural interaction interface. We then analyze ChatGPT’s limitations in factual errors, toxic content generation, safety, fairness, interpretability, and data privacy, and discuss the importance of clarifying its capability boundaries. After that, we analyze the concept of truth and explain why ChatGPT cannot distinguish truth from falsehood from the non-equivalence of three references. In discussing AI's future, we analyze mid-to-short term technological trends and the long-term development path from the relationship between perception, cognition, emotion, and behavioral intelligence. Lastly, we explore ChatGPT’s potential impact on cognitive cost, education, Turing Test understanding, academia’s opportunities and challenges, information cocoons, energy and environmental issues, and productivity enhancement.

    Special issue on Agile Development of Processor Chips
    Abstract:

    Building a system-level prototype platform with FPGAs for hardware-software integration of one processor design under test (DUT) is an essential step in pre-silicon evaluations of the processor chip design. In order to meet design requirements of open-source processors based on the emerging open RISC-V instruction set architecture with minimized FPGA development efforts, we propose a system-level platform with a tightly-coupled SoC-FPGA chip for agile hardware-software integration and evaluation of RISC-V DUT processors. In specific, we first elaborate the interconnect between the DUT and the SoC via the existing SoC-FPGA interfaces. Then we introduce a scheme of virtual inter-processor interrupt to support highly-efficient collaboration between the DUT and the hardcore ARM processor in the SoC-FPGA. As a result, the DUT is able to flexibly leverage various I/O peripherals for full-system evaluation. The hardcore ARM processor is also involved for acceleration of the DUT's time-consuming software workloads. Additionally, we build a configurable framework on cloud for flexible composition and system integration of the DUT's hardware and software components. Based on our evaluation results with a couple of target RISC-V processors, we believe that our proposed platform is of great significance in improving efficiency and shortening iteration period when building a system-level prototype platform.

    Abstract:

    Chiplet integration is becoming a highly scalable solution of customizing deep learning chips for different scenarios, thus many chip designers start to reduce the chip development cost by integrating "known-good" third-party dies, which shows advantages in higher yield, design flexibility, and shorter time-to-market. In conventional chip business model, the dedicated software toolchain such as compiler is provided as part of the chip solution and plays an important role in chip performance and development. However, when it comes to chip solution that assembles multiple third-party dies, the toolchain must face the situation that is unknown to the dedicated compiler of die vendors in advance. In such a situation, how to dispatch tasks to hardware resources and manage the cooperation between the provided interfaces of independent third-party dies becomes a necessity. Moreover, designing a whole-new toolchain for each integrated chip is time-consuming and even deviating the original intention of agile chip customization. In this paper, we propose Puzzle, a scalable compilation and resource management framework for integrated deep learning chips. Puzzle contains a complete framework from profiling the input workload to run-time management of chip resources, and reduces redundant memory access and expensive inter-die communication through efficient and self-adaptive resource allocation and task distribution. Experimental results show that Puzzle achieves an average of 27.5% latency reduction under various chip configurations and workloads compared with state-of-the-art solutions.

    Abstract:

    Digital signal processors (DSPs) commonly adopt VLIW-SIMD architecture and facilitate cooperation between scalar and vector units. As a typical VLIW-SIMD DSP architecture, the extreme performance of FT-Matrix DSP relies on highly optimized kernels. However, hand-crafted methods for kernel operator development always suffer from heavy time and labor overhead to unleash the potential of DSP hardware. General-purpose compilers are suffering from poor portability or performance, struggling to explore optimization space aggressively. We propose a high-performance automatic kernel code-generation framework, which introduces the characteristics of FT-Matrix into hierarchical kernel optimizations. The framework has three optimization component layers: loop tiling, vectorization and instruction-level optimization, and can automatically search for optimal tile size according to memory hierarchy and data layout, and further introduce the vectorization with scalar-vector unit cooperation to improve data reuse and parallelism, while some optimization space on collaborating scalar and vector units for specific design in architectures by different vendors is overlooked. The performance of VLIW architecture is determined by instruction-level parallelism (ILP) to a great extent. Moreover, Pitaya provides the assembly intrinsic representation on FT-Matrix DSP to apply diverse instruction-level optimizations to explore more ILPs. Experiments show that kernels generated by Pitaya outperform these from target DSP libraries by 3.25 times and C vector intrinsic kernels by 20.62 times on average.

    Abstract:

    When developing high-performance processors, accurate and fast performance estimation is the basis for design decisions and parameter exploration. Prior work accelerates processor RTL emulation through workload sampling and architectural checkpoints for RTL, which makes it possible to estimate the performance of benchmarks such as SPECCPU running on complex high-performance processors within a few days. However, waiting a few days for performance results is still too long for architecture iteration, and there is still room for further shortening the performance measurement cycle. During RTL emulation of processors, the warm up phase consumes a significant amount of time. As a solution to expedite the warm up phase during performance evaluation, the HyWarm framework is developed. HyWarm analyzes the warm up demand of workloads with the micro-architectural simulator, and adaptively customizes the warm up scheme for each workload. For workloads with high warm up demand on caches, HyWarm performs functional warm up through the caches’ bus protocol on RTL. For detailed emulation part, HyWarm utilizes CPU clustering and LJF scheduling to reduce the maximum completion time. Compared with the best existing sampling-based RTL emulation method, HyWarm reduces the emulation completion time by 53% under the premise of similar accuracy to the baseline method.

    Distributed Computing
    Abstract:

    As cloud computing technology advances continuously, there are a growing number of enterprises and organizations choosing the inter-cloud approach to apply on IT delivery. Inter-cloud environments can efficiently solve problems such as low resource utilization, resource limitation, and vendor lock-in in traditional single-cloud environments, and manage cloud resources in an integrated model. Due to the heterogeneity of resources in the inter-cloud environment, which will complicate the scheduling of inter-cloud tasks. Based on the current status, how to logically schedule user tasks and allocate them to the most suitable inter-cloud resources for execution has developed to be an important issue to be solved in the inter-cloud environment. From the perspective of the inter-cloud environment, we discuss the progress and future challenges of research on the task of scheduling algorithms under this environment. Firstly, combined with the characteristics of an inter-cloud environment, cloud computing is divided into federated cloud and multi-cloud environments and introduced in detail. Meanwhile, the existing task scheduling types are reviewed and their advantages and disadvantages are analyzed. Secondly, based on the classification and current research procedure, representative documents are selected to analyze the algorithms for task scheduling on inter-cloud. Finally, shortcomings in research on algorithms for task scheduling in inter-cloud and future research trends are discussed, which provide a reference for further research on inter-cloud task scheduling.

    Abstract:

    With the increasing demand of edge intelligence, federated learning (FL) has been now of great concern to the industry. Compared with the traditionally centralized machine learning that is mostly based on cloud computing, FL collaboratively trains the neural network model over a large number of edge devices in a distributed way, without sending a large amount of local data to the cloud for processing, which makes the compute-extensive learning tasks sunk to the edge of the network closed to the user. Consequently, the users’ data can be trained locally to meet the needs of low latency and privacy protection. In mobile edge networks, due to the limited communication resources and computing resources, the performance of FL is subject to the integrated constraint of the available computation and communication resources during wireless networking, and also data quality in mobile device. Aiming for the applications of edge intelligence, the tough challenges for seeking high efficiency FL are analyzed here. Next, the research progresses in client selection, model training and model updating in FL are summarized. Specifically, the typical work in data unloading, model segmentation, model compression, model aggregation, gradient descent algorithm optimization and wireless resource optimization are comprehensively analyzed. Finally, the future research trends of FL in edge intelligence are prospected.

    Abstract:

    Edge computing is commonly applied in emerging fields such as the Internet of things, the Internet of vehicles, and online games. Edge computing provides low-latency computing services for terminal devices by deploying computing resources at network edges. How to offload tasks to balance execution time and communication time and how to schedule tasks with different deadlines with the objective of minimizing the total tardiness are challenging problems. In this paper, a task offloading and scheduling framework is proposed for the heterogeneous edge computing. There are five components included in the framework: sequencing edge network nodes, sequencing offloaded task, task offloading strategies, task scheduling and the solution improvement. Multiple task offloading and task scheduling strategies are designed and embedded. ANOVA (multi-factor analysis of variance) is used to calibrate the algorithmic components and parameters over a large number of random instances. The algorithm with the best component combination is obtained. Based on the EdgeCloudSim simulation platform, several variants of the proposed algorithm are compared with the proposed algorithm from the perspectives of the number of edge nodes, the number of tasks, the distribution of tasks, and the interval of deadlines. Experimental results show that the proposed algorithm outperforms the other comparisons in all cases.

    Abstract:

    Deep neural network (DNN) has been widely used in many areas of human society. Increasing the size of DNN model significantly improves the model accuracy, however, training DNN model on a single GPU requires considerable time. Hence, how to train large-scale DNN models in parallel on GPU cluster by distributed deep learning (DDL) technology has been paid much attention by industry and academia. Based on the above analysis, we propose a dynamic resource scheduling (DRS) method for GPU cluster in the heterogeneous GPU cluster environment with different bandwidth among GPUs. The goal of DRS is to solve the multi-DNN scheduling problem with the requirement of deadline constraint. Specifically, firstly, a resource-time model is constructed based on the Ring-AllReduce communication architecture to measure the running time of DDL tasks under different resource schemes. Then, a resource-performance model is built based on the deadline requirement to achieve efficient resource utilization; Finally, DRS is designed to implement resource scheme decision for DDL tasks based on the above model and resource layout. In this way, scheduling tasks are selected for actual resource allocation based on the principle of nearest deadline, and a migration mechanism is introduced to reduce the impact of resource fragmentation scenarios in the scheduling process. Experiments on the heterogeneous GPU cluster with 4 NVIDIA GeForce RTX 2080 Tis show that DRS improves the deadline guaranteeing rate by 39.53% compared with the comparison algorithms, and the resource utilization of GPU cluster reaches 91.27% in the scheduling process.

    Computer System Architecture
    Abstract:

    Time-sensitive networking (TSN) guarantees the real-time and determinism for critical traffic through spatio-temporal resource planning. The planning tool uses the maximum switching delay of each chip under heavy load as input parameters when allocating temporal resources. In order to satisfy the low-delay requirement of TSN applications, the TSN chip designers are supposed to minimize the maximum switching delay as an important goal. Current commercial TSN chips generally adopt a single-pipeline switching architecture, which is prone to “complete frame blocking” at the entrance of the pipeline, resulting that it is hard to reduce the maximum switching delay. Therefore, we propose a multi-pipeline switching architecture named nPSA based on time division multiplexing mechanism, which optimizes the “complete frame blocking” into a “slice blocking” problem. Moreover, the weighted round-robin slot allocation algorithm (WRRSA) is proposed for the time division multiplexing mechanism to calculate the slot allocation scheme under different port types. At present, the nPSA architecture and WRRSA algorithm have been applied in the OpenTSN open-source chip and the “HX-DS09” ASIC chip. The actual test results show that the maximum switching delay time experienced by the 64B critical frame in the OpenTSN chip and the “HX-DS09” chip are 1648ns and 698ns, respectively. Compared with the theoretical value are the TSN switching chip designed based on the single-pipeline architecture, the delay value are reduced by about 88% and 95% respectively.

    Abstract:

    Branch prediction is an essential optimization for both the performance and power of modern processors, enabling instructions ahead of branches to be executed speculatively in parallel. Different from the general branch prediction, procedure return can be conquered with a return-address stack (RAS). By using a speculative emulation of the call stack according to the last-in-first-out rule for procedure calls and returns, the RAS predicts return addresses accurately. However, due to wrong-path corruptions under speculative execution of real processors, the RAS needs a repair mechanism to maintain the accuracy of the storage. Especially for embedded processors which are sensitive to the area, a careful trade-off between the accuracy and the overhead of repair mechanisms could be necessary. To address the redundancy of RAS storage, we introduce hybrid RAS, a return-address predictor based on a persistent stack. By integrating the classical stack, the persistent stack, and the backup prediction with the detection of overflows, our proposal could eliminate wrong-path corruptions and redundancies at the same time. As a result, the return misprediction rate is reduced effectively and efficiently. In addition, the classical stack is decoupled from the persistent stack to further optimize the area. With benchmarks from the SPEC CPU 2000 suite, the experiments show that our proposed RAS can reduce MPKI(mis-predictions per kilo instructions)to 2.4×10−3with a design area of only 1.1×104 μm2 under design compiler, whose misses are reduced by over 96% compared with the state-of-the-art RAS.

    Artificial Intelligence
    Abstract:

    Federated learning is an emerging distributed machine learning method that enables mobile phones and IoT devices to learn a shared machine learning model with only transferring model parameters to protect private data. However, traditional federated learning models usually assume training data samples are independent and identically distributed (IID) on the local devices which are not feasible in the real-world, due to the data distributions are different in the different local devices. Hence, existing federated learning models cannot achieve satisfied performance for mixed distribution on the Non-IID data. In this paper, we propose a novel federated adaptive interaction model (FedAIM) for mixed distribution data that can jointly learn IID data and Non-IID data at the same time. In FedAIM, earth mover's distance (EMD) to measure the degree of bias for different client users is introduced for the first time. Then, an extremely biased server and a non-extremely biased server are built to separately process client users with different bias degrees. At the last, a new aggregation mechanism based on information entropy is designed to aggregate and interact model parameters to reduce the number of communication rounds among servers. The experimental results show that the FedAIM outperforms state-of-the-art methods on MNIST, CIFAR-10, Fashion-MNIST, SVHN and FEMNIST of real-world image datasets.

    Abstract:

    Generative adversarial imitation learning is an inverse reinforcement learning (IRL) method based on generative adversarial framework to imitate expert policies from expert demonstrations. In practical tasks, expert demonstrations are often generated from multi-modal policies. However, most of the existing generative adversarial imitation learning (GAIL) methods assume that the expert demonstrations are generated from a single modal policy, which leads to the mode collapse problem where the generative adversarial imitation learning can only partially learn the modal policies. Therefore, the application of the method is greatly limited for multi-modal tasks. To address the mode collapse problem, we propose the multi-modal imitation learning method with cosine similarity (MCS-GAIL). The method introduces an encoder and a policy’s group, extracts the modal features of the expert demonstrations by the encoder, calculates the cosine similarity of the features between the sample of policy sampling and the expert demonstrations, and adds them to the loss function of the policy’s group to help the policy’s group learn the expert policies of the corresponding modalities. In addition, MCS-GAIL uses a new min-max game formulation for the policy’s group to learn different modal policies in a complementary way. Under the assumptions, we prove the convergence of MCS-GAIL by theoretical analysis. To verify the effectiveness of the method, MCS-GAIL is implemented on the Grid World and MuJoCo platforms and compared with the existing mode collapse methods. The experimental results show that MCS-GAIL can effectively learn multiple modal policies in all environments with high accuracy and stability.

    Abstract:

    Asynchronous advantage actor-critic (A3C) constructs a parallel deep reinforcement learning framework composed by one-Learner and multi-Workers. However, A3C produces the high variance solutions, and Learner does not obtain the global optimal policy. Moreover, it is difficult to transfer and deploy from the large-scale parallel network to the low consumption end-platform. Aims to above problems, we propose a compression and knowledge extraction model based on supervised exploring, called Compactt_A3C. In the proposed model, we freeze Workers of the pre-trained A3C to measure these performances in the common state, and map the performances to probabilities by softmax. In this paper, we update Learner according to such probability, which is to obtain the global optimal sub-model (Worker) and enhance resource utilization. Furthermore, the updated Learner is assigned as Teacher Network to supervise Student Network in the early exploration stage. We exploit the linear factor to reduce the guidance of Teacher Network for encouraging the free exploration of Student Network. And building up two types of Student Network to demonstrate the effectiveness aims at the proposed model. In the popular states including Gym Classic Control and Atari 2600, the level of Teacher Network is achieved. The code of proposed model is published in https://github.com/meadewaking/Compact_A3C.

    Abstract:

    The linear bandits model is one of the most foundational online learning models, where a linear function parametrizes the mean payoff of each arm. The linear bandits model encompasses various applications with strong theoretical guarantees and practical modeling ability. However, existing algorithms suffer from the data irregularity that frequently emerges in real-world applications, as the data are usually collected from open and dynamic environments. In this paper, we are particularly concerned with two kinds of data irregularities: the underlying regression parameter could be changed with time, and the noise might not be bounded or even not sub-Gaussian, which are referred to as model drift and heavy-tailed noise, respectively. To deal with the two hostile factors, we propose a novel algorithm based on upper confidence bound. The median-of-means estimator is used to handle the potential heavy-tailed noise, and the restarting mechanism is employed to tackle the model drift. Theoretically, we establish the minimax lower bound to characterize the difficulty and prove that our algorithm enjoys a no-regret upper bound. The attained results subsume previous analysis for scenarios without either model drift or heavy-tailed noise. Empirically, we additionally design several online ensemble techniques to make our algorithm more adaptive to the environments. Extensive experiments are conducted on synthetic and real-world datasets to validate the effectiveness.

    Network and Information Security
    Abstract:

    With the application of all kinds of deep learning generation models in various fields, the authenticity of their generated multimedia files has become increasingly difficult to distinguish, therefore, deepfake technology has been born and developed. Utilizing deep learning related techniques, the deepfake technology can tamper with the facial identity information, expressions, and body movements in videos or pictures, and generate fake voice of a specific person. Since 2018, when Deepfakes sparked a wave of face swapping on social networks, a large number of deepfake methods have been proposed, which had demonstrated their potential applications in education, entertainment, and some other fields. But at the same time, the negative impact of deepfake on public opinion, judicial and criminal investigations, etc. can not be ignored. As a consequence, more and more countermeasures have been proposed to prevent deepfake from being utilized by the criminals, such as the detection of deepfake and watermark. Firstly, a review and summary of deepfake technologies of different modal types and corresponding detection technologies are carried out, and the existing researches are analyzed and classified according to the research purpose and research method. Secondly, the video and audio datasets widely used in the recent studies are summarized. Finally, the opportunities and challenges for future development in this field are discussed.

    Abstract:

    The single sign on (SSO) scheme can avoid the waste of resources and information leakage caused by the redundancy of authentication module, and the anonymous single sign on can realize anonymous authentication and authorization under the condition of protecting personal privacy. However, the existing anonymous single sign on schemes do not consider the accountability of fraud caused by the anonymity of users. For this problem, a traceable anonymous single sign on scheme on lattice is proposed. The proposed scheme uses the identity-based cryptosystem on lattice to alleviate the problem of public key certificate management, and realizes the anonymous authentication of the user through the authorized authentication tag and pseudonym. Then, the strong designated verifier technology is used to realize the directional verification of user service requests. And the trusted organization is introduced to recover the user's identity and pursue responsibility through the public key. The proposed scheme is proved to have unlinkability, unforgeability and traceability under the security model. The security and performance analysis results show that under PARMS II and PARMS III, our scheme can generate the access service tickets for 4 service requests by running for about 75 ms and 108 ms respectively. And it can reach the quantum security strength of 230 b and 292 b.