Recent Publications

Read the latest research from our department


Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees

Jonathan Brophy, Zayd Hammoudeh, Daniel Lowd

Influence estimation analyzes how changes to the training data can lead to different model predictions; this analysis can help us better understand these predictions, the models making those predictions, and the data sets they are trained on. However, most influence-estimation techniques are designed for deep learning models with continuous parameters. Gradient-boosted decision trees (GBDTs) are a powerful and widely-used class of models; however, these models are black boxes with opaque decision-making processes. In the pursuit of better understanding GBDT predictions and generally improving these models, we adapt recent and popular influence-estimation methods designed for deep learning models to GBDTs. Specifically, we adapt representer-point methods and TracIn, denoting our new methods TREX and BoostIn, respectively; source code is available at We compare these methods to LeafInfluence and other baselines using 5 different evaluation measures on 22 real-world data sets with 4 popular GBDT implementations. These experiments give us a comprehensive overview of how different approaches to influence estimation work in GBDT models. We find BoostIn is an efficient influence-estimation method for GBDTs that performs equally well or better than existing work while being four orders of magnitude faster. Our evaluation also suggests the gold-standard approach of leave-one-out (LOO) retraining consistently identifies the single-most influential training example but performs poorly at finding the most influential set of training examples for a given target prediction.

Journal of Machine Learning Research, Volume 24, Issue 154, May 2023


Instance-Based Uncertainty Estimation for Gradient-Boosted Regression Trees

Jonathan Brophy, Daniel Lowd

Gradient-boosted regression trees (GBRTs) are hugely popular for solving tabular regression problems, but provide no estimate of uncertainty. We propose Instance-Based Uncertainty estimation for Gradient-boosted regression trees (IBUG), a simple method for extending any GBRT point predictor to produce probabilistic predictions. IBUG computes a non-parametric distribution around a prediction using the k-nearest training instances, where distance is measured with a tree-ensemble kernel. The runtime of IBUG depends on the number of training examples at each leaf in the ensemble, and can be improved by sampling trees or training instances. Empirically, we find that IBUG achieves similar or better performance than the previous state-of-the-art across 22 benchmark regression datasets. We also find that IBUG can achieve improved probabilistic performance by using different base GBRT models, and can more flexibly model the posterior distribution of a prediction than competing methods. We also find that previous methods suffer from poor probabilistic calibration on some datasets, which can be mitigated using a scalar factor tuned on the validation data. Source code is available at

Advances in Neural Information Processing Systems 35, December 2022

Efficient, out-of-memory sparse MTTKRP on massively parallel architectures

Andy Nguyen, Ahmed E Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Yongseok Soh, Teresa Ranadive, Fabrizio Petrini, Jee W Choi

Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. In contrast to prior work, the proposed Blocked Linearized Coordinate (BLCO) format enables efficient out-of-memory computation of tensor algorithms using a unified implementation that works on a single tensor copy. Our adaptive blocking and linearization strategies not only meet the resource constraints of GPU devices, but also accelerate data indexing, eliminate control-flow and memory-access irregularities, and reduce kernel launching overhead. To address the substantial synchronization cost on GPUs, we introduce an opportunistic conflict resolution algorithm, in which threads collaborate instead of contending on memory access to discover and resolve their conflicting updates on-the-fly, without keeping any auxiliary information or storing non-zero elements in specific mode orientations. As a result, our framework delivers superior in-memory performance compared to prior state-of-the-art, and is the only framework capable of processing out-of-memory tensors. On the latest Intel and NVIDIA GPUs, BLCO achieves 2.12 — 2.6X geometric-mean speedup (with up to 33.35X speedup) over the state-of-the-art mixed-mode compressed sparse fiber (MM-CSF) on a range of real-world sparse tensors.

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing, Article 26, Pages 1–13, June 2022

Dynamic Scheduling of Approximate Telemetry Queries

Chris Misa, Walt O'Connor, Ramakrishnan Durairajan, Reza Rejaie and Walter Willinger

Network telemetry systems provide critical visibility into the state of traffic flowing through modern computer networks. While significant progress has been made by leveraging programmable switch hardware to scale these systems to high and time-varying traffic workloads, less attention has been paid towards efficiently utilizing limited hardware resources in the face of dynamics such as the composition of traffic as well as the number and types of queries run at a given point in time. To efficiently handle traffic and query dynamics we develop DynATOS, the first scheduling system for running network traffic queries on constrained switch hardware while adapting to changing query and resource requirements. DynATOS leverages a novel time-division approach to approximation and multiplexes switch hardware resources among submitted queries using an optimization formulation. We prototype and evaluate DynATOS on a runtime-programmable switch hardware telemetry module.

Proceedings of USENIX Symposium on Networked Systems Design and Implementation, Renton, WA, April 2022

AI in 5G: The Case of Online Distributed Transfer Learning over Edge Networks

Yulan Yuan, Lei Jiao, Konglin Zhu, Xiaojun Lin, Lin Zhang

This paper conducts a first-of-its-kind study of realizing online transfer learning in distributed cloud-edge networks, addressing crucial challenges of online model training, uncertain network environments, time-coupled decision making, and the balance between resource consumption and model accuracy.



SYMBIOMON: A High Performance, Composable Monitoring Service

S. Ramesh, R. Ross, M. Dorier, A. Malony, P. Carns, and K. Huck

High Performance Computing (HPC) software is evolving to support an increasingly diverse set of applications and heterogeneous hardware architectures. As part of this evolution, the construction of scientific software has shifted from a traditional monolithic message passing interface (MPI) executable model to a coupled, services-style model in which simulations run alongside a host of distributed HPC data services within the same batch job allocation. Microservices have emerged as a powerful new way to build these distributed data services through a composition model. However, performance analysis of composed microservices is a daunting challenge. It requires collecting, monitoring, aggregating, and exporting performance data from multiple sources. To be effective, the design of such a monitoring solution must allow for seamless integration into HPC applications and distributed services alike, be scalable, operate with a low-overhead, and take advantage of the HPC platform. We propose SYMBIOMON, a monitoring service that is built by composing high-performance microservices. We describe its design and implementation within the context of the Mochi framework. SYMBIOMON combines a time-series data model with existing Mochi data services to collect, aggregate, and export performance metrics in a distributed manner. SYMBIOMON enables seamless, low-overhead monitoring and analysis of data services and HPC applications alike. Using HEPnOS, a production-quality Mochi data service, we demonstrate the use of SYMBIOMON to identify better service configurations.

28th IEEE International Conference on High Performance Computing, Data, and Analytics, December 2021

LEAP: Leakage-Abuse Attack on Efficiently Deployable, Efficiently Searchable Encryption with Partially Known Dataset

Jianting Ning, Xinyi Huang, Geong Sen Poh, Jiaming Yuan, Yingjiu Li, Jian Weng, Robert H. Deng

Efficiently deployable, efficiently searchable encryption (EDESE) enables private queries to be executed on encrypted documents in a practical manner. We propose LEAP, a new leakage abuse attack on EDESE schemes that can accurately recover the underlying keywords of query tokens based on partially known documents. We conduct extensive experiments to demonstrate the effectiveness of our attack by varying levels of attacker’s background knowledge.

ACM Conference on Computer and Communications Security, 2307-2320, Seoul, South Korea, November 14-19, 2021

Crosslingual Transfer Learning for Relation and Event Extraction via Word Category and Class Alignments

Minh Van Nguyen, Tuan Ngo Nguyen, Bonan Min and Thien Huu Nguyen

We propose a novel crosslingual alignment method that leverages class information of Relation and Event Extraction (REE) tasks for representation learning. In particular, we propose to learn two versions of representation vectors for each class in an REE task based on either source or target language examples. Representation vectors for corresponding classes will then be aligned to achieve class-aware alignment for crosslingual representations. In addition, we propose to further align representation vectors for language-universal word categories (i.e., parts of speech and dependency relations). As such, a novel filtering mechanism is presented to facilitate the learning of word category representations from contextualized representations on input texts based on adversarial learning. We conduct extensive crosslingual experiments with English, Chinese, and Arabic over REE tasks. The results demonstrate the benefits of the proposed method that significantly advances the state-of-the-art performance in these settings.

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, November 7-11, 2021

Minimizing Development Costs for Efficient Many-Core Visualization Using MCD3

K. Moreland, R. Maynard, D. Pugmire, A. Yenpure, A. Vacanti, M. Larsen, and H. Childs

Scientific visualization software increasingly needs to support many-core architectures. However, development time is a significant challenge due to the breadth and diversity of both visualization algorithms and architectures. With this work, we introduce a development environment for visualization algorithms on many-core devices that extends the traditional data-parallel primitive (DPP) approach with several existing constructs and an important new construct: meta-DPPs. We refer to our approach as MCD3 — Meta-DPPs, Convenience routines, Data management, DPPs, and Devices. The twin goals of MCD3 are to reduce developer time and to deliver efficient performance on many-core architectures, and our evaluation considers both of these goals. For development time, we study 57 algorithms implemented in the VTK-m software library and determine that MCD3 leads to significant savings.  For efficient performance, we survey ten studies looking at individual algorithms and determine that the MCD3 hardware-agnostic approach leads to performance comparable to hardware-specific approaches: sometimes better, sometimes worse, and better in the aggregate. In total, we find that MCD3 is an effective approach for scientific visualization libraries to support many-core architectures.

Parallel Computing, 108:102834, September 29, 2021

Performance-Portable Sparse Tensor Decomposition Kernels on Emerging Parallel Architectures

Sean Isaac Geronimo Anderson, Keita Teranishi, Daniel M. Dunlavy, Jee W. Choi

We leverage the Kokkos library to study performance-portability of parallel sparse tensor decompositions on CPU and GPU architectures. Our result shows that with a single implementation, Kokkos can deliver performance comparable to hand-tuned code for simple array operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems, and superior performance for the matricized tensor times Khatri-Rao product, a key performance bottleneck on many tensor algorithms, on CPUs.

The 25th Annual IEEE Conference on High Performance Extreme Computing, September 21-23, 2021

Unleash GPT-2 Power for Event Detection

Amir Pouran Ben Veyseh, Viet Dac Lai, Franck Dernoncourt and Thien Huu Nguyen

Event Detection (ED) aims to recognize mentions of events (i.e., event triggers) and their types in text. Recently, several ED datasets in various domains have been proposed. However, the major limitation of these resources is the lack of enough training data for individual event types which hinders the efficient training of data-hungry deep learning models. To overcome this issue, we propose to exploit the powerful pre-trained language model GPT-2 to generate training samples for ED. To prevent the noises inevitable in automatically generated data from hampering training process, we propose to exploit a teacher-student architecture in which the teacher is supposed to learn anchor knowledge from the original data. The student is then trained on combination of the original and GPT-generated data while being led by the anchor knowledge from the teacher. Optimal transport is introduced to facilitate the anchor knowledge-based guidance between the two networks. We evaluate the proposed model on multiple ED benchmark datasets, gaining consistent improvement and establishing state-of-the-art results for ED.

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, August 2021

Evaluating Adaptive and Predictive Power Management Strategies for Optimizing Visualization Performance on Supercomputers

S. Brink, M. Larsen, H. Childs, and B. Rountree

Power is becoming an increasingly scarce resource on the next generation of supercomputers, and should be used wisely to improve overall performance. One strategy for improving power usage is hardware overprovisioning, i.e., systems with more nodes than can be run at full power simultaneously without exceeding the system-wide power limit. With this study, we compare two strategies for allocating power throughout an overprovisioned system — adaptation and prediction — in the context of visualization workloads. While adaptation has been suitable for workloads with more regular execution behaviors, it may not be as suitable on visualization workloads, since they can have variable execution behaviors. Our study considers a total of 104 experiments, which vary the rendering workload, power budget, allocation strategy, and node concurrency, including tests processing data sets up to 1 billion cells and using up to 18,432 cores across 512 nodes. Overall, we find that prediction is a superior strategy for this use case, improving performance up to 27% compared to an adaptive strategy.

Parallel Computing, 104-105:102782, July 2021

Artemis: Automatic Runtime Tuning of Parallel Execution Parameters Using Machine Learning

C. Wood, G. Georgakoudis, D. Beckingsale, D. Poliakoff, A. Gimenez, K. Huck, A. Malony, T. Gamblin

Portable parallel programming models provide the potential for high performance and productivity, however they come with a multitude of runtime parameters that can have significant impact on execution performance. Selecting the optimal set of those parameters is non-trivial, so that HPC applications perform well in different system environments and on different input data sets, without the need of time-consuming parameter exploration or major algorithmic adjustments.  Artemis is a method for online, feedback-driven, automatic parameter tuning using machine learning that is generalizable and suitable for integration into high-performance codes. Artemis monitors execution at runtime and creates adaptive models for tuning execution parameters, while being minimally invasive in application development and runtime overhead.  The effectiveness of Artemis is demonstrated by optimizing the execution times of three HPC proxy applications: Cleverleaf, LULESH, and Kokkos Kernels SpMV. Evaluation shows that Artemis selects the optimal execution policy with over 85% accuracy, has modest monitoring overhead of less than 9%, and increases execution speed by up to 47% despite its runtime overhead.

36th ISC High Performance Computing, June 24-July 2, 2021

Investigating In Situ Reduction via Lagrangian Representations for Cosmology and Seismology Applications

S. Sane, C. R. Johnson, and Hank Childs

Although many types of computational simulations produce time-varying vector fields, subsequent analysis is often limited to single time slices due to excessive costs. Fortunately, a new approach using a Lagrangian representation can enable time-varying vector field analysis while mitigating these costs. With this approach, a Lagrangian representation is calculated while the simulation code is running, and the result is explored after the simulation. Importantly, the effectiveness of this approach varies based on the nature of the vector field, requiring in-depth investigation for each application area. With this study, we evaluate the effectiveness for previously unexplored cosmology and seismology applications. We do this by considering encumbrance (on the simulation) and accuracy (of the reconstructed result). To inform encumbrance, we integrated in situ infrastructure with two simulation codes, and evaluated on representative HPC environments, performing Lagrangian in situ reduction using GPUs as well as CPUs. To inform accuracy, our study conducted a statistical analysis across a range of spatiotemporal configurations as well as a qualitative evaluation. In all, we demonstrate effectiveness for both cosmology and seismology — time-varying vector fields from these domains can be reduced to less than 1% of the total data via Lagrangian representations, while maintaining accurate reconstruction and requiring under 10% of total execution time in over 80% of our experiments.

Awarded best paper award for Main Track (650 submissions) in International Conference on Computational Science, pages 436–450, Krakow, Poland, June 16-18, 2021

Machine Unlearning for Random Forests

Jonathan Brophy and Daniel Lowd

Random forests are a widely used machine learning method. We show how to update them by removing training examples and their impact on the model. This is useful when people want their personal data removed from a machine learning system. Our method is much faster than retraining the models from scratch but yields the exact same results.

In Proceedings of the International Conference on Machine Learning, June 11, 2021

High Performance Streaming Tensor Decomposition

Yongseok Soh, Patrick Flick, Xing Liu, Shaden Smith, Fabio Checconi, Fabrizio Petrini, Jee Choi

In this study, we develop a new algorithm for computing tensor decomposition on streaming data that achieves up to 102× speedup over the state-of-the-art CP-stream algorithm through lower computational complexity and performance optimization. Sparse tensor decomposition is a popular method for analyzing multi-way data in applications such as signal processing, topic monitoring, and trend analysis. In many of these areas, data arrives in a streaming fashion over time (e.g., new updates on social media), and this poses significant challenges in performance and scalability for existing algorithms. To address this challenge, we devise a new algorithmic formulation that greatly reduces the computational complexity of existing algorithms, and apply hybrid synchronization, data blocking, and operation fusion to achieve significant speedup and scalability on 56 cores over prior state-of-the-art.

35th IEEE International Parallel and Distributed Processing Symposium, May 17-21, 2021

SYMBIOSYS: A Methodology for Performance Analysis of Composable HPC Data Services

S. Ramesh, A. Malony, P. Carns, R. Ross, M. Dorier, J. Soumagne, and S. Snyder

We propose a methodology for integrated performance analysis of HPC microservices frameworks and applications called SYMBIOSYS. We describe its design and implementation within the context of the Mochi framework. This integration is achieved by combining distributed callpath profiling and tracing with a performance data exchange strategy that collects fine-grained, low-level metrics from the RPC communication library and network layers. The result is a portable, low-overhead performance analysis setup that provides a holistic profile of the dependencies among microservices and how they interact with the Mochi RPC software stack. Using HEPnOS, a production-quality Mochi data service, we demonstrate the low-overhead operation of SYMBIOSYS at scale and use it to identify the root causes of poorly performing service configurations.

IEEE International Parallel and Distributed Symposium, pp. 35–45, May 17-21, 2021

Learning for Learning: Predictive Online Control of Federated Learning with Edge Provisioning

Yibo Jin, Lei Jiao, Zhuzhong Qian, Sheng Zhang, Sanglu Lu

Summary: This paper designs novel online algorithms to operate federated learning over distributed cloud-edge networks, managing data transference from user devices to edge clouds, resource provisioning at edge clouds, and federated learning between edge and central clouds, based only on the predicted inputs about the dynamic and uncertain system environments.

IEEE INFOCOM May 10-13, 2021

ALTO: Adaptive Linearized Storage of Sparse Tensors

Ahmed E. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa Ranadive, Fabrizio Petrini, Jee W. Choi

The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multidimensional space close to each other in memory for better data reuse, eliminates workload imbalance, and greatly reduces the synchronization overhead of tensor computations. ALTO achieves a geometric mean speedup of 8× over the best mode-agnostic (coordinate and hierarchical coordinate) formats, while delivering a geometric mean compression ratio of 4.3× relative to the best mode-specific (compressed sparse fiber) formats.

The 35th ACM International Conference on Supercomputing, April 27, 2021

Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh and Thien Huu Nguyen

We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory usage and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages. Our toolkit along with pretrained models and code are publicly available at: A demo website for our toolkit is also available at: Finally, we create a demo video for Trankit at:

In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, April 19-23, 2021

On the Resilience of Internet Infrastructures in the Pacific Northwest to Earthquakes

Juno Mayer, Valerie Sahakian, Emilie Hooft, Douglas Toomey and Ramakrishnan Durairajan

The U.S. Pacific Northwest (PNW) is one of the largest Internet infrastructure hubs for several cloud and content providers, research networks, colocation facilities, and submarine cable deployments. Yet, this region is within the Cascadia Subduction Zone and currently lacks a quantitative understanding of the resilience of the Internet infrastructure due to seismic forces. The main goal of this work is to assess the resilience of critical Internet infrastructure in the PNW to shaking from earthquakes. To this end, we have developed a framework called ShakeNet to understand the levels of risk that earthquake-induced shaking poses to wired and wireless infrastructures in the PNW.

Proceedings of Passive and Active Measurements (PAM), Virtual, March 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks

Minh Van Nguyen, Viet Dac Lai and Thien Huu Nguyen

This paper presents a novel deep learning model to simultaneously solve the four tasks of IE (i.e., entity mention recognition, relation extraction, event trigger detection, and argument extraction) in a single model (called FourIE). Compared to a few prior work on jointly performing four IE tasks, FourIE features two novel contributions to capture inter-dependencies between tasks. First, at the representation level, we introduce an interaction graph between instances of the four tasks that is used to enrich the prediction representation for one instance with those from related instances of other tasks. Second, at the label level, we propose a dependency graph for information types in the four IE tasks that captures connections between types expressed in an input sentence. A new regularization mechanism is introduced to enforce the consistency between the golden and predicted type dependency graphs to improve representation learning. We show that the proposed model achieves state-of-the-art performance for joint IE on both monolingual and multilingual learning settings with three different languages.

In Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics, March 26, 2021

Differential Training: A Generic Framework to Reduce Label Noises for Android Malware Detection.

Jiayun Xu, Yingjiu Li, Robert Deng

A common problem in machine learning-based malware detection is that training data may contain noisy labels and it is challenging to make the training data noise-free at a large scale. To address this problem, we propose a generic framework to reduce the noise level of training data for the training of any machine learning-based Android malware detection. In our experiments with three different Android malware detection approaches, our framework can detect significant portions of wrong labels in different training datasets at different noise ratios and improve the performance of Android malware detection approaches.

Presented at the 28th Network and Distributed System Security Symposium, February 21-24, 2021

Duality in action

P. Downen and Z. M. Ariola

We show how the concept of duality can be put to use in the theory and practice of programming languages and their implementations. Starting from a foundation of constructive logic as dialogues, we illustrate how it describes a symmetric language for computation, and survey several applications of the dualities found therein.

In International Conference on Formal Structures for Computation and Deduction, Buenos Aires, Argentina, 2021


What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation

Amir Pouran Ben Veyseh, Franck Dernoncourt, Quan Hung Tran and Thien Huu Nguyen

Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing. Due to their importance, identifying acronyms and corresponding phrases, i.e., acronym identification (AI), and finding the correct meaning of each acronym, i.e., acronym disambiguation (AD) are crucial for text understanding. Despite recent progress on this task, there are some limitations in the existing datasets which hinder further improvement. More specifically, limited size of manually annotated AI datasets or noises in the automatically created acronym identification datasets obstruct designing advanced high-performing acronym identification models. Moreover, the existing datasets are mostly limited to the medical domain and ignore other domains. In order to address these two limitations, we introduce a manually annotated large AI dataset for scientific domain. This dataset contains 17,506 sentences which is substantially larger than previous scientific AI datasets. We also present an AD dataset for scientific domain with 62,441 samples which is significantly larger than previous scientific AD dataset. Our experiments show that the existing state-of-the-art models fall far behind human-level performance on both datasets proposed by this work. In addition, we propose a new deep learning model which utilizes syntactic structures of sentences to expand ambiguous acronyms in sentences. The proposed model outperforms the state-of-the-art models on the new AD dataset, providing a strong baseline for future research in this area.

In Proceedings of the 28th International Conference on Computational Linguistics, December 8-13, 2020

CCAMP: An Integrated Translation and Optimization Framework for OpenACC and OpenMP

J. Lambert, S. Lee, and A. Malony. J. Vetter

Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP's device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.

Supercomputing Conference, November 2020

MEPHESTO: Modeling Energy-Performance in Heterogeneous SoCs and Their Trade-Offs

M. Monil, M. Belviranli, S. Lee, J. Vetter, and A. Malony

Summary: This paper presents MEPHESTO, a novel and holistic approach for managing the balance of performance and energy in heterogeneous systems. The authors characterize applications and PUs in terms of two memory contention factors - time factors and power factors - to achieve the desired trade-off between energy and performance for collocated kernel execution on heterogeneous systems. The authors believe that this investigation is the first to combine all of these factors and present a simple knob-based approach that expresses the target trade-off. The approach is evaluated on a diverse integrated shared memory heterogeneous system with a CPU, GPU, and programmable vision accelerator. By using an empirical model for memory contention that provides up to 92% accuracy, the kernel collocation approach can provide a near-optimal ordering and placement based on the user-defined, energy-performance trade-off parameter. Moreover, the dynamic programming-based heuristics provide up to 30% better energy or 20% performance benefits when compared with the greedy approaches commonly employed by previous studies.

ACM International Conference on Parallel Architectures and Compilation Techniques, pp. 413– 425, September 2020

Compiling With Classical Connectives

P. Downen and Z. M. Ariola

The study of polarity in computation has revealed that an "ideal" programming language combines both call-by-value and call-by-name evaluation; the two calling conventions are each ideal for half the types in a programming language. But this binary choice leaves out call-by-need which is used in practice to implement lazy-by-default languages like Haskell. We show how the notion of polarity can be extended beyond the value/name dichotomy to include call-by-need by adding a mechanism for sharing which is enough to compile a Haskell-like functional language with user-defined types.

Logical Methods in Computer Science, Volume 16, Issue 3, pages 13:1-13:57, August 28, 2020

Kinds are calling conventions

P. Downen and Z. M. Ariola and S. P. Jones and R. Eisenberg

A language supporting polymorphism is a boon to programmers: they can express complex ideas once and reuse functions in a variety of situations. However, polymorphism is pain for compilers tasked with producing efficient code that manipulates concrete values. The paper presents a new intermediate language that allows efficient static compilation, while still supporting flexible polymorphism. The key insight is to encode information about a value’s calling convention in the kind of its type, rather than in the type itself.

International Conference on Functional Programming, Jersey City, New Jersey, August 2020

Designing Leakage-Resilient Password Entry on Head-Mounted Smart Wearable Glass Devices

Yan Li, Yao Cheng, Weizhi Meng, Yingjiu Li, Robert H. Deng

With the boom of Augmented Reality (AR) and Virtual Reality (VR) applications, head-mounted smart wearable glass devices are becoming popular to help users access various services like E-mail freely. However, most existing password entry schemes on smart glasses rely on additional computers or mobile devices connected to smart glasses, which require users to switch between different systems and devices. This may greatly lower the practicability and usability of smart glasses. In this research, we address this challenge by designing three anti-eavesdropping password entry schemes on stand-alone smart glasses and validating their practicality and usability in various situations.

Institute of Electrical and Electronics Engineers Transactions on Information Forensics and Security, July 30, 2020

Scheduling DDoS Cloud Scrubbing in ISP Networks via Randomized Online Auctions

Wencong You, Lei Jiao, Jun Li, Ruiting Zhou

This paper designs an online repetitive auction mechanism to enable Internet service providers to outsource malicious traffic scrubbing to third-party security service providers with scrubbing centers, to mitigate massive, distributed denial-of-service attacks with less economic cost and better network performance.

IEEE INFOCOM July 6-9, 2020

Opportunities for Cost Savings with In Transit Visualization.

J. Kress, M. Larsen, J. Choi, M. Kim, M. Wolf, N. Podhorszki, S. Klasky, H. Childs, and D. Pugmire.

We analyze the opportunities for in-transit visualization to provide cost savings compared to in-line visualization. We begin by developing a cost model that includes factors related to both in-line and intransit which allows comparisons to be made between the two methods.  We then run a series of studies to create a corpus of data for our model.  We run two different visualization algorithms, one that is computation heavy and one that is communication heavy with concurrencies up to 32,768 cores. Our primary results are in exploring the cost model within the context of our corpus. Our findings show that in-transit consistently achieves significant cost efficiencies by running visualization algorithms at lower concurrency, and that in many cases these efficiencies are enough to offset other costs (transfer, blocking, and additional nodes) to be cost effective overall. Finally, this work informs future studies, which can focus on choosing ideal configurations for in-transit processing that can consistently achieve cost efficiencies.

In ISC High Performance, pages 146–165, Frankfurt, Germany, June 2020

A First Comparative Characterization of Multi-cloud Connectivity in Today's Internet

Bahador Yeganeh, Ramakrishnan Durairajan, Reza Rejaie and Walter Willinger

Today’s enterprises are adopting multi-cloud strategies at an unprecedented pace. However, little is known about the performance aspects, routing issues, and topological features associated with currently available multicloud connectivity options. To shed light on the tradeoffs between these available connectivity options, we take a cloud-to-cloud perspective and present in this paper the results of a cloud-centric measurement study of a coast-to-coast multi-cloud deployment that a typical modern enterprise located in the US may adopt.

Proceedings of Passive and Active Measurements (PAM), Oregon, March 2020

Privacy-Preserving Network Path Validation

Binanda Sengupta, Yingjiu Li, Kai Bu, Robert Deng

For a better quality of service in network communications, the source node often opts for a superior (or premium) network path to send packets to the destination node. Network path validation schemes enable each node present on a network path to validate whether each packet has followed the specific path so far. In this work, we introduce two notions of privacy – path privacy and index privacy – in the context of network path validation. We design PrivNPV, a privacy-preserving network path validation protocol, that satisfies both path privacy and index privacy. We discuss several attacks related to network path validation and how PrivNPV defends against these attacks.

Association for Computing Machinery Transactions on Internet Technology, February 2020

Learning from Positive and Unlabeled Data with Arbitrary Positive Shift

Zayd Hammoudeh and Daniel Lowd

Positive-unlabeled learning is a method for building machine learning models from only positive examples and unlabeled examples, such as emails that are known to be spam and unknown emails, some of which are spam and some of which are not. An additional challenge with domains like spam is that spam is always changing – a spam filter trained a year ago may not work as well on today’s spam. We show how you can take positive and unlabeled data from the past, combine it with unlabeled data from the present, and build a model that works well today. Our key assumption is that the negative class (e.g., non-spam) doesn’t change significantly, even though the positive class (e.g., spam) may change arbitrarily.

In Proceedings of the 34th Conference on Neural Information Processing Systems, Virtual, 2020