• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search

2025  Vol. 62  No. 4

column
Artificial Intelligence
Abstract:

As the scale of available data increases, the importance and impact of machine learning grows. It has been found that quantum computing can be realized with the help of the principles of quantum mechanics, and the quantum machine learning algorithm formed by combining quantum computing and machine learning can theoretically produce exponential acceleration advantages over classical machine learning algorithms. Quantum versions of many classical algorithms have been proposed and they may solve problems that are difficult to classical computers. At present, limited by the quantum computing hardware, the number of controllable qubits, noise, and other factors restrict the development of quantum computers. Quantum computing hardware is unlikely to reach the level needed for universal quantum computers in the short term, and current research focuses on the algorithms that can run on noisy intermediate-scale quantum (NISQ) computers. Variational quantum algorithms (VQAs) are hybrid quantum classical algorithms which are suitable for current quantum computing devices. Related research is one of the research hotspots in the field of quantum machine learning. Variational quantum circuits (VQCs) are parameterized quantum circuits (PQCs) used in variational quantum algorithms to solve quantum machine learning tasks. It is also be called Ansatz and quantum neural networks (QNNs). The framework of variational quantum algorithm mainly contains five steps: 1) Designing the loss function according to the task. Designing parameterized quantum circuits as model and initializing parameters. 2) Embedding classical data. The classical data is pre-processed and encoded to the quantum state. If quantum data is used as input, it only needs to be pre-processed without encoding. 3) Calculating the loss function through parameterized quantum circuit. This step is where quantum advantage comes in. 4) Measuring and post-processing. Through quantum measurement operation, the quantum superposition state wave packet collapses into classical state. The classical data can be obtained after post-processing. 5) Optimizing the parameters. Updating parameters and optimizing the model with classical optimization algorithms and then returning to step 3 until the loss function converges after several iterations. We can obtain a set of optimal parameters. The final result is the output of the optimal model. This paper reviews the basic theory of quantum computing and the basic framework of variational quantum algorithm, and further introduces the application and progress of variational quantum algorithm in the field of quantum machine learning, then reviews supervised quantum machine learning including quantum classifiers, unsupervised quantum machine learning including quantum circuit born machine, variational quantum Boltzmann machine and quantum autoencoder, semi-supervised quantum learning including quantum generative adversarial network, quantum reinforcement learning, and quantum circuit architecture search in detail. Next, this paper compares the models and analyses their advantages and disadvantages, and briefly discusses and summarizes the related datasets and simulation platforms that can reproduce the introduced models. Finally, this paper puts forward the challenges and future research trends of quantum machine learning algorithms based on variational quantum circuit.

Abstract:

In complex environments and under sudden background noise conditions, speech enhancement tasks are extremely challenging due to the limited capturing of spectrogram features by existing methods, especially in capturing local information of the spectrogram. Previous work on Transformer models primarily focused on global information of the audio while neglecting the importance of local information. Many models only utilized the magnitude information and ignored the phase information after the audio underwent short-time Fourier transform (STFT), resulting in suboptimal capturing of spectrogram features and unsatisfactory speech enhancement results. Based on this, we propose a dual-branch speech enhancement neural network with convolutional enhancement window attention. The model adopts a U-NET architecture and simultaneously models the magnitude and phase information of the audio through the dual-branch structure. A complex computation module is introduced for information interaction between the two branches. The convolutional enhancement window attention module is employed in the skip-connection part between the encoder and decoder layers. It performs self-attention based on non-overlapping windows, significantly reducing the computational complexity of the speech enhancement model while capturing local contextual information. The proposed model is evaluated on the publicly available Voicebank-Demand dataset. Compared with the baseline models DCUNET 16 and DCUNET20, the proposed model achieves improvements of 0.51 and 0.47, respectively, in PESQ (perceptual evaluation of speech quality) metric. Other evaluation metrics also show significant enhancements. Compared with various existing speech enhancement models, the proposed model outperforms them in various metrics, particularly demonstrating remarkable improvements in PESQ scores.

Abstract:

The problem of topological imbalance in graphs, arising from the non-uniform and asymmetric distribution of nodes in the topological space, significantly hampers the performance of graph neural networks. Current research predominantly focuses on labeled nodes, with relatively less attention given to unlabeled nodes. To address this challenge, we propose a self-supervised learning method based on random walk paths aimed at tackling the issues posed by topological imbalance, including the constraints imposed by homogeneity assumptions, topological distance decay, and annotation attenuation. Our method introduces the concept of multi-hop paths within the subgraph neighborhood, aiming to comprehensively capture relationships and local features among nodes. Firstly, through a strategy of aggregating between paths, we can learn both homogeneous and heterogeneous features within multi-hop paths, thereby not only preserving the nodes’ original attributes but also maintaining their initial structural connections in the random walk sequences. Additionally, by combining a strategy of aggregating subgraph samples based on multiple paths with structured contrastive loss, we maximize the intrinsic features of local subgraphs for the same node, enhancing the expressive power of graph representations. Experimental results validate the effectiveness and generalization performance of our method across various imbalanced scenarios. This research provides a novel approach and perspective for addressing topological imbalance issues.

Abstract:

Due to the expensive cost of production of paired images, unpaired low-light image enhancement methods are more practical as they do not rely on paired image data. However, their lack of detailed supervised signals leads to visual degradation problems such as global exposure inconsistencies, color distortions, and lots of noise in the output image, which makes them challenging for practical applications. We propose an unpaired low light enhancement method based on global consistency (GCLLE) to meet practical needs. Firstly, we remodel and fuse the same scale features of the encoder and decoder through the global consistency preserving module (GCPM) to correct the contextual information of different scales, to ensure the consistency of the global exposure adjustment and the global structural consistency of the output image, making the image light distribution uniform and avoiding the distortion; The local smoothing and modulation module (LSMM) is used to learn a set of local low-order curve mappings, which provides extended dynamic range and further improves the quality of the image to achieve realistic and natural enhancement; the proposed deep feature enhancement module (DFEM), which uses two-way pooling to fuse deep features, compresses irrelevant information and highlights more discriminative coded features, reducing inaccuracies and making it easier for the decoder to capture low-intensity signals in the image and retaining more details. Unlike pairwise enhancement, which focuses on the one-to-one mapping relationship between pixels in paired images, GCLLE enhances by reducing the stylistic differences between low-light and unpaired normal-light images. Through extensive experiments on MIT and LSRW datasets, the method proposed in this paper outperforms the classical low-light enhancement algorithms in several objective metrics, demonstrating the effectiveness and superiority of our method.

Abstract:

Dynamic functional connections (dFCs) can be regarded as a process of dynamic changes in multiple time windows to explore the changes in functional connections of the brain in different time periods. It has been widely used in resting state functional magnetic resonance imaging (rs-fMRI) analysis, providing a new perspective and strategy for the diagnosis of brain diseases. However, the common dynamic brain network analysis methods can not effectively use the potential correlation and timing between dynamic data, and ignore the uncertainty factors caused by the inconsistent data quality of each window. Therefore, we propose a brain network analysis algorithm based on dynamic evidence neural networks (DE-NNs). This algorithm designs a multi-view evidence acquisition module of dynamic brain network, which treats each time window of dynamic brain network as a view. Three different convolution filters are used to extract the feature maps of each time window of the dynamic brain network, and the evidence of the dynamic level is fully obtained. A dynamic evidence fusion mechanism is designed to make full use of dynamic evidence. The dynamic trust function is constructed according to the time sequence of dFC data based on the evidence theory synthesis rules. The evidence generated by multiple windows is fused at the decision level of classification, the uncertainty information is fully considered, and the classification performance is significantly improved. Experiments are conducted on three schizophrenia datasets compared with existing advanced algorithms in order to verify the effectiveness of the proposed DE-NNs. The results show that the accuracy and F1 scores of DE-NNs on the three brain disease diagnosis tasks are significantly improved.

Abstract:

In traditional question-answering tasks, models generally require extensive data for training, which entails considerable time and manpower costs for data annotation. Unsupervised question generation represents an effective solution to address the scarcity of training data in question-answering tasks. However, the questions generated using this approach currently suffer from issues such as being difficult to answer, lacking variety, and having unclear semantics. To address these issues, we propose an adaptive multi-module pipeline model named ADVICE, with modules improving existing methods in answerability, question diversity and grammatical correctness. Within the question answerability module, we employ coreference resolution and named entity recognition techniques to improve the answerability of questions. For question diversity, we design specific rules for various question types to enhance the diversity of question and answer types. In the grammatical correctness module, a grammar error correction model targeted at questions is trained based on T5 model, and a filtering module is designed to refine the generated question-answer data. Finally, a classifier is trained to automatically select the necessary modules. Experiments demonstrate that the improved question generation method enhances the performance of downstream question-answering models on the SQuAD dataset, with the EM (exact match) score increasing by an average of 2.9% and the F1 score by an average of 4.4%.

Abstract:

Aspect sentiment triplet extraction (ASTE) is a challenging subtask within aspect-based sentiment analysis. It aims to extract triplets consisting of aspect terms, opinion terms, and sentiment polarities from texts. In the recent past, generative extraction techniques have demonstrated remarkable efficacy through the sequential concatenation of target triplets, thereby enabling the autoregressive generation of triplets. However, this concatenation method may lead to sequential dependencies among unrelated triplets, introducing error accumulation during decoding. To address this issue, we propose a term-prompted and dual-path text generation (TePDuP) method. This method first utilizes machine reading comprehension (MRC) to extract aspect and opinion term in parallel, and then uses them as prompt prefixes to guide conditional triplet generation, forming a dual-path text generation framework. Meanwhile, during the training phase, we incorporate scheduled sampling as a corrective measure to mitigate the bias stemming from MRC extraction. Furthermore, in order to enhance performance to an even greater extent, we incorporate generation probabilities to merge outcomes guided by aspect and opinion terms, thereby augmenting the resilience of the model. Experimental results on the ASTE-DATA-V2 dataset show that the proposed method is effective and significantly outperforms other baseline models, and provide case studies to demonstrate that the method solves the aforementioned problem to some extent.

Architecture
Abstract:

Ensuring deadlock-free data transmission in the network-on-chip (NoC) is a prerequisite for providing reliable communication services for multi-processor system-on-chip (MPSoC), directly determining the availability of NoC and even MPSoC. Existing general-purpose deadlock-free strategies are oriented to arbitrary topologies, making it challenging to utilize the features and advantages of a specific topology. Moreover, these strategies may even increase network latency, power consumption, and hardware complexity. In addition, due to significant differences in the regular network between routing-level and protocol-level deadlocks, existing solutions struggle to simultaneously address both types of deadlock issues, affecting MPSoC reliability. We propose a deadlock-free strategy with synchronous Hamiltonian rings based on the inherent Hamiltonian characteristics of the triplet-based many-core architecture (TriBA). This method uses the topology’s symmetric axes and Hamiltonian edge to allocate independent store-and-forward buffers for data transmission, preventing protocol-level deadlocks and improving data transfer speed. Additionally, we design a directional determination method for data transmission within the same buffer using cyclic linked-list technology. This method ensures data independence and synchronous forward transmission, eliminates routing-level deadlocks, and reduces data transfer latency. Based on optimizing redundant calculations in look-ahead routing algorithms, we propose a deadlock-free routing mechanism called Hamiltonian shortest path routing (HamSPR) based on a synchronous Hamiltonian ring. GEM5 simulation results show that, compared with existing solutions in the TriBA, HamSPR reduces average packet latency and power consumption in synthetic traffic patterns by 18.78%−65.40% and 6.94%−34.15%, respectively, while improving throughput by 8.00%−59.17%. In the PARSEC benchmark, HamSPR achieves maximum reductions of 16.51% in application runtime and 42.75% in average packet latency, respectively. Moreover, compared with 2D-Mesh, TriBA demonstrates an application performance improvement of 1%−10% in PARSEC benchmark.

Abstract:

With the advancement of electronic design automation, continuous-flow microfluidic biochips have become one of the most promising platforms for biochemical experiments. This chip manipulates fluid samples in milliliters or nanoliters by utilizing internal microvalves and microchannels, and thus automatically performs basic biochemical experiments, such as mixing and detection. To achieve the correct bioassay function, the microvalves deployed inside the chip are usually managed by a multiplexer-based control logic, and valves receive control signals from a core input through the control channel for accurate switching. Since biochemical reactions typically require high sensitivity, the length of control paths connecting each valve needs to be reduced to ensure immediate signal propagation, and thus to reduce the signal propagation delay. In addition, to reduce the fabrication cost of chips, a vital issue to be addressed in the logic architecture design is how to effectively reduce the total channel length within the control logic. To address the above issues, we propose a deep reinforcement learning-based control logic routing algorithm to minimize the signal propagation delay and total control channel length, thereby automatically constructing an efficient control channel network. The algorithm employs the dueling deep Q-network architecture as the agent of the deep reinforcement learning framework to evaluate the tradeoff between signal propagation delay and total channel length. Besides, the diagonal channel routing is implemented for the first time for control logic, thus fundamentally improving the efficiency of valve switching operations and reducing the fabrication cost of the chip. The experimental results demonstrate that the proposed algorithm can effectively construct a high-performance and low-cost control logic architecture.

Abstract:

Constructing a software and hardware system-level prototype platform for accelerating data center services requires the consideration of factors such as high computing power, scalability, flexibility, and low cost. To enhance data center capabilities, research from the perspective of software-hardware synergy has been conducted on the innovation of heterogeneous computing in cloud platform architecture, hardware implementation, high-speed interconnection, and applications. A reconfigurable and combinable software-hardware acceleration prototype system is designed and built to simplify existing processor-centric system-level computing platform construction methods, enabling rapid deployment and system-level prototype validation of target software-hardware designs. To achieve these objectives, methods such as decoupled reconfigurable architecture device virtualization and remote mapping are utilized to uncover the potential of independent computing units. An ISOF (independent system of FPGA) software-hardware computing platform system is constructed to surpass the capabilities of conventional server designs, enabling low-cost and efficient expansion of computing units while allowing clients to flexibly utilize peripheral resources. To address system-level communication challenges, a communication hardware platform and interaction mechanism between computing units are designed. Additionally, to enhance the agility of the software-hardware system-level platform, ISOF provides a flexible and unified invocation interface. Finally, through the analysis and evaluation of the system-level objectives of the platform, it has been verified that the platform meets the current computing and acceleration requirements, ensuring high-speed, low-latency communication, as well as good throughput and efficient elastic scalability. In addition, improvements have been made in congestion avoidance and packet recovery mechanisms based on high-speed communication, meeting the stability requirements of communication at data center scale.

Abstract:

Continuous-flow microfluidic biochips (CFMBs) have become a hot research topic in recent years due to their ability to perform biochemical assays automatically and efficiently. For the first time, PathDriver+ takes the requirements of the actual fluid transportation into account in the design process of CFMBs and implements the actual fluid transport and removal, and plans separate flow paths for each transport task, which have been neglected in previous work. However, PathDriver+ does not take full advantage of the flexibility of CFMBs routing because it only considers the optimization of flow channel length for the global routing in the mesh model, except for the detailed routing. In addition, PathDriver+ only considers the X architecture, while the existing work shows that the any-angle routing can utilize the routing resources more efficiently and shorten the flow channel length. To address the above issues, we propose a flow path-driven arbitrary angle routing algorithm, which can improve the utilization of routing resources and reduce the flow channel length while considering the actual fluid transportation requirements. The proposed algorithm constructs a search graph based on constrained Delaunay triangulation to improve the search efficiency of routing solutions while ensuring the routing quality. Then, a Dijkstra-based flow path routing method is used on the constructed search graph to generate a routing result with a short channel length quickly. In addition, in the routing process, channel reuse strategy and intersection optimization strategy are proposed for the flow path reuse and intersection number optimization problems, respectively, to further improve the quality of routing results. The experimental results show that compared with the latest work PathDriver+, the length of channels, the number of ports used, and the number of channel intersections are significantly reduced by 33.21%, 11.04%, and 44.79%, respectively, and the channel reuse rate is improved by 26.88% on average, and the total number of valves introduced at intersections is reduced by 42.01% on average, which demonstrates the effectiveness of the algorithm in this paper.

Abstract:

With the advancement of modern computer technology, the memory wall problem is getting more and more severe. Under this background, the last-level cache in multi-level memory hierarchy becomes a key resource affecting system performance. In recent years, various researches have optimized the last-level cache by means of size expansion and dynamic resource management. Way-partitioning technique is the main method of cache resource management, which optimizes system performance by partitioning the cache into ways and allocating them to each application. However, it is coarse-grained and requires all sets of caches to follow the same way-partitioning strategy. In fact, applications may have different space demand on different sets, and the way-partitioning technique restricts the space utilization of the cache, resulting in a waste of cache resources. In this paper, we propose an on-demand fine-grained cache resource management technique, GroupUCP, whose design idea is to aggregate individual cache sets into groups based on the different space demand of each application on each set, using dynamic grouping and real-time evaluation. Each group can be allocated space on demand independently, thus improving cache utilization and overall system performance. Experiments demonstrate that GroupUCP achieves finer-grained on-demand resource allocation using less hardware resources than the traditional UCP approach and achieves higher system performance improvement in cache-sensitive application combinations which shows imbalance space demand of cache.

Abstract:

With run-time configurable hardware, coarse-grained reconfigurable array (CGRA) is a potential platform to provide both program flexibility and energy efficiency for data-intensive applications. To exploit the access parallelism of the multi-bank memory, memory partitioning is usually introduced to CGRAs. However, existing work for memory partitioning on CGRAs either achieves the optimal partitioning solution with expensive addressing overheads or achieves area-and-energy efficient hardware at the sacrifice of more bank consumption. To this end, we propose an efficient memory partitioning approach for loop pipelining on CGRA via access pattern morphing. By performing a memory partitioning and scheduling co-optimization on multi-dimensional arrays, a memory partition-friendly access pattern is formed in the data domain such that it can be partitioned with a minimized number of all-one partitioning hyperplanes, resulting in both optimized partition factor and reduced addressing overhead. To solve the partitioning problem, firstly, we propose a backtracking-based scheduling algorithm to find the partition-friendly pattern with minimized initiation interval. Then, based on the partitioning result, we also propose an energy-area-efficient CGRA architecture by simplifying the address generators in load-store units. The experimental results show that our approach can achieve 1.25 times energy efficiency while keeping a moderate compilation time, as compared with the state-of-the-art method.

Software Technology
Abstract:

As the fundamental computing component in constructing large-scale supercomputing systems, GPUs are undergoing architectural diversity and heterogeneity. GPU accelerators from various chip manufacturers exhibit significant variations in their architectural designs. Accelerator diversity and programming model diversity are important technical trends for building large-scale supercomputing systems. Diverse accelerators require developers to provide high-performance software for multiple hardware platforms, resulting in software duplication. To reduce the cost of duplication, the unified programming model SYCL (system-wide compute language) adapts to multiple hardware platforms, but SYCL’s performance on different hardware is not as good as the native programming model of the platform, and SYCL’s performance needs to be further optimized. In order to be able to apply the mature and complete CUDA (compute unified device architecture) programming ideas and high-performance programs to SYCL, it is necessary to discuss the performance of high-performance CUDA programs ported to SYCL on multiple platforms and the ideas for further optimization. Based on software-hardware co-design, we propose paraTRANS: a common operator optimization system for the code migration process of cross-heterogeneous programming model SYCL, and give the optimization methods for the migrated SYCL GEMM (general matrix multiplication) in different scenarios. We evaluate the performance of SYCL GEMM optimized by paraTRANS, which can achieve 96.95% of CUDA’s FLOPS on the original NVIDIA RTX 3090, and 100.47% of CUDA’s hardware peak performance percentage on AMD MI100, both close to the level before migration. This paper provides ideas for porting high-performance CUDA code to SYCL and further optimization.

Abstract:

Timing anomalies are counter-intuitive behaviors observed in worst-case execution time (WCET) analysis. A key aspect of these anomalies is that a locally faster execution does not necessarily lead to a reduction in the overall program execution time. Therefore, WCET analysis must examine all possible execution states conservatively to ensure the safety of the analysis results, making the process extremely challenging. On the contrary, if it can be ensured that there are no timing anomalies in the program and platform to be analyzed, the number of states and the time required for WCET analysis can be significantly reduced. Consequently, addressing timing anomalies is a critical challenge in WCET analysis. However, despite more than 20 years of research, the academic community has not reached a unified definition and consensus on the problem of timing anomalies. This article reviews various perspectives from the literature since the concept of timing anomalies was first introduced. We classify these viewpoints based on their definitions and descriptions, and evaluate their respective strengths and weaknesses. Additionally, we investigate the causes of timing anomalies and identify three main factors: scheduling strategies, cache behavior, and component interactions. Furthermore, we explore current research efforts aimed at detecting and eliminating timing anomalies, highlighting the issues and limitations of these approaches. Finally, we suggest that future research on timing anomalies should be integrated with WCET analysis methods to more effectively address these challenges.

Network and Information Security
Abstract:

With the continuous development of cloud computing technology, the reversible data hiding in encrypted images (RDHEI) has received more and more attention. But most of the reversible data hiding in encrypted images are based on grey-scale, which have great limitations in application scenario compared with color images. Moreover, since the current reversible data hiding methods in the encrypted domain mainly focus on grey-scale images, and there are few optimizations for the characteristics of color images, it is hard to obtain better performance by applying these algorithms, so it is of high value to further investigate the reversible data hiding algorithm in color encrypted images. In this paper, we propose a high-performance RDHEI of color images algorithm for the first time based on color channels correlation and entropy encoding (RDHEI-CE) for cloud computing. First, the RGB channels of the color image are separated and the prediction errors are derived separately. Next, the embedding space is generated by adaptive entropy encoding and prediction errors histogram. The correlation between color channels is then used to further expand the embedding space and embed secret message on the encrypted image. Finally, the marked encrypted image must be scrambled in order to resist a ciphertext-only attack. Compared with most state-of-the-art RDHEI methods, experimental results show that the RDHEI-CE method provides a greater embedding rate and better security and broadens the application scene of reversible data hiding in the cloud.

Abstract:

Traceable attribute-based signature (TABS) inherits the merits of attribute-based signature and can trace the real identity of the signer through a trusted third party, avoiding the abuse of anonymity of attribute-based signature. At present, there are very few signature-policy attribute-based signature (SP-ABS) schemes that support traceability in one-to-many authentication scenario, and most of the existing schemes suffer from efficiency and security deficiencies, for example, the computational complexity of the verification phase is linearly related to the number of attributes, which is inefficient. Meanwhile, the fact that the policy is provided directly by the verifier to the signer can easily lead to policy privacy leakage. To solve the above problems, a traceable attribute-based signature scheme supporting policy hiding based on SM9 is proposed in this paper. The scheme uses a linear secret sharing scheme (LSSS) with attribute name and attribute value splitting to construct the access structure, supports partial hiding of the policies, and can protect the policy privacy of the verifier while protecting the signer’s identity privacy and attribute privacy. In the verification phase, the scheme only requires constant order bilinear pairing operations and exponential operations, which can achieve efficient fine-grained access control. Finally, the scheme is proved to be unforgeable under the random oracle model by the q-strong Diffie-Hellman (q-SDH) hard problem.

Abstract:

WiFi-based respiratory monitoring becomes a hot spot in the sensing layer of IoT benefiting from non-contact, low cost and high privacy protection. However, current WiFi-based respiratory monitoring methods relay on sensitive channel state information (CSI) samples which requires that single monitoring target keeps static without any moving non-target person and closing to the WiFi transceiver device. These requirements limit the large-scale applications of WiFi-based respiratory monitoring. Therefore, we propose a respiratory monitoring range extension method named FDRadio, which is able to work under dynamic interference scenes. In FDRadio, we improve the accuracy and robustness of respiratory monitoring from three aspects: separating dynamic interference sources, eliminating ambient noise and enhancing power of dynamic reflected signal. Specifically, we first expand the channel bandwidth by combining multiple WiFi channels to improve the spatial resolution of WiFi sensing, and employ wired direct channel to remove the accumulated hardware noise caused by combining channels. Second, we analyze the relationship between monitoring range and ambient noise, and then adopt time diversity techniques to design a two-stage ambient noise deduction process for FDRadio. In addition, we design a novel weight allocation algorithm, which maximizes the dynamic reflected signal power, and enhances the ability to sensing weak chest fluctuation caused by breath. Finally, the processed CSI samples are converted to power delay spectrum (PDP) in time domain. By this, the respiratory signal can be directly extracted from the target person using the distance difference. We implement FDRadio on a commercial embedded devices and conduct a series of experiments. The experimental results show that detection error is less than 0.5 bpm under the 7m available monitoring range, even if multiple moving non-target person exists.