PACT'18 Paper #43 Reviews and Comments
===========================================================================
Paper #43 Massively Parallel Skyline Computation For Processing-In-Memory
Architectures


Review #43A
===========================================================================

Overall merit
-------------
3. Weak accept

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
This paper proposes a skyline implementation for PIM architectures and shows that it performs better than traditional solutions.

Comments for author
-------------------
The paper is well written and I enjoyed reading about skyline computation as well as its mapping on PIM architectures. It was difficult for me to determine why the PIM architecture under consideration is (logically) different from distributed systems except for the fact that it might have better latency/bandwidth charcteristics. That said, the authors have done a reasonable job of proposing a solution suitable for PIM architectures and showing that it is better than traditional architectures. It was also not clear why the PIM architecture did not have measurements, appears to still be projected data although with some validation on FPGA implementation.

Overall, the paper was well written and could serve a great reference for those interested in the intersection of skyline computation and PIM architectures. Not clear however whether there are any generalized findings that are more broadly applicable and still novel/insightful.


Review #43B
===========================================================================

Overall merit
-------------
3. Weak accept

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
The paper describes the DSky algorithm to parallelize the data intensive Skyline computation on PIM architectures. The experimental results report comparison wrt previous algorithms used for the Skyline computation [21], [11] and [7]. The reported results prove higher throughput, good scalability and better energy consumption compared to CPUs and GPUs.

Comments for author
-------------------
The paper is well written and well structured.
The addressed topic od parallelizing the Skyline computation on PIM architecture is quite narrow. I'm not sure if this topic could match the general interests of the PACT community.

Questions to the authors:

The Skyline operator was first introduced in 2001 [9] as a new relational algebra operator.
The authors should specify more formally the similarities and differences between the Skyline operator and the concepts of Pareto dominance, Pareto curve and Pareto Optimality.

Please explain clearly how is the Pareto optimality related to the Skyline Operator.

When and why a user should take advantage of choosing the Skyline operator with respect to the Pareto analysis? When/why it has been used extensively in the database community?

Please refer to some previous literature on Pareto analysis for multi-objective DSE of computer architectures such as

1) Gianluca Palermo, Cristina Silvano, Vittorio Zaccaria:
ReSPIR: A Response Surface-Based Pareto Iterative Refinement for Application-Specific Design Space Exploration. IEEE Trans. on CAD of Integrated Circuits and Systems 28(12): 1816-1829 (2009)
2) Giovanni Mariani, Aleksandar Brankovic, Gianluca Palermo, Jovana Jovic, Vittorio Zaccaria, Cristina Silvano: A correlation-based design space exploration methodology for multi-processor systems-on-chip. DAC 2010: 120-125
3) Vittorio Zaccaria, Gianluca Palermo, Fabrizio Castro, Cristina Silvano, Giovanni Mariani:
Multicube Explorer: An Open Source Framework for Design Space Exploration of Chip Multi-Processors. ARCS Workshops 2010: 325-331

Some concepts in the Intro should be clarified such as the sentence: "Modern processors leverage the integration of many compute cores on a single chip to mitigate the effects of processing large dataset".

Why do you never mention in the Intro the presence of cache memories in the memory hierarchy?


The experimental results report comparison wrt previous algorithms used for the Skyline computation published in [21], [11] and [7]. 
However, the paper lacks of describing the application of the DSky algorithm to a practical use case. 
Please present a practical use case of application of the Skyline operator for the multi-objective optimization of manyore architectures.


Review #43C
===========================================================================

Overall merit
-------------
3. Weak accept

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
This paper proposes a massively parallel skyline algorithm optimized for a
specific commercial PIM system (i.e., UPMEM) as a representative example.
Skyline practically represents an operator to extract pareto optimal points in
multi-dimensional datasets, and therefore, has an important set of applications.
The idea is based on bulk-synchronous processing, enhanced data partitioning
strategies for load balancing.

+ The paper is well-written, clearly explaining major design trade-offs and
challenges.

+ The proposed framework can be extended to map similar data-intensive
applications to PIM efficiently.

- The PIM solution is restricted to the target platform, UPMEM, which comes with
  major restrictions regarding the execution model (i.e., very limited
  communication between processors). In fact, UPMEM conforms more to "processing
  near-memory (PNM)" than true "processing in-memory (PIM)". It is not clear how
  the proposed algorithm can be generalized to different PNM/PIM designs.

Comments for author
-------------------
- "PIM" these days is no more just tied to the DRAM technology.
 
- The logic in PIM systems can come in all different variations, where the
  processors do not necessarily have to operate in isolation, as the paper
  claims in the abstract. This, however, appears to be the case for the target
  PIM system of the paper, UPMEM.

- Please provide more detail on how the energy is calculated/measured. 

- Please keep white spaces below Figure captions.


Review #43D
===========================================================================

Overall merit
-------------
3. Weak accept

Reviewer expertise
------------------
1. No familiarity

Paper summary
-------------
This paper presents a parallel "skyline" algorithm for Processor-In-Memory architectures. The skyline operation is typically both data and compute intensive and this work addresses the challenges of executing it in parallel on a PIM system using a new load balancing strategy. Experiments using a skyline dataset generator mentioned in reference[9], demonstrate 2x to 14x throughput improvement relative to state-of-the-art CPU & GPU architecture based algorithms.

Comments for author
-------------------
The problem domain of PIM architectures and leveraging them for accelerating the performance of massive datasets processing applications is of current interest in the industry and research community.

Not knowing anything about skyline operations and PIM architectures, I found this paper written reasonably well enough to be educational and interesting.

As this paper is all about the skyline operation, a motivating example of the practical application of this operation in a modern and popular real world application/scenario would add more value. It will make this work appear to be less of a mere academic exercise.

Related to the previous comment, this work seems to rely on knowledge of prior work mentioned in reference [9] "S.Borszony et.al The skyline operator." A brief description of this prior work in the Introduction or Related Work section would have been helpful to reviewers who are unfamiliar with this area.

Again, in the Evaluation section 6.2 there's reference to a "standard skyline dataset generator[9]". Not having studied the prior work, it is not clear to me how "standard" or popular this input dataset is in the community.

Last but not the least, the last sentence of the abstract states "Despite our focus on the skyline problem, our work provides also the skeleton for a general parallel framework for .... PIM systems". I did not get this information from this paper and suggest removal of this claim.


Response by Vasileios Zois <vzois001@ucr.edu> (Author) (983 words)
---------------------------------------------------------------------------
We would like to thank all the reviewers for their very constructive comments. Below we discuss thoroughly the issues raised and present our replies:

**(Review43B) How is pareto optimality related to the skyline? When and why use the skyline operator?**  
Finding the skyline is the same problem as finding the Pareto optimal front in economics. The term skyline has been introduced in [9] (based on the Manhattan skyline example) and has since been used extensively in the database community. Works in economics rely on numerical methods to find the solution (i.e. maximize the Lagrangian) while database research utilizes pairwise comparisons called Dominance Tests (DTs) to discover the qualifying tuples. 

**(Review43B) Could you present a practical use case of the skyline operator?**  
**(Review43D) Can you provide a motivating example for computing the skyline?**  
Computing the skyline is important in multi-feature search queries: the user has to pick among products with competing characteristics (e.g. price, condition, age, quality) that cannot produce a strict ordering. A classic example is picking a hotel, given the hotel’s price and its distance to the beach. Although users prefer affordable hotels, those close to the beach are likely expensive. In this case, the skyline operator would present hotels that are no worse than any other in both price and distance to the beach. 

**(Review43A) How is PIM logically different from a distributed system?**  
	A typical PIM model [16, 30] does not support direct node-to-node communication: every data exchange needs to pass through the host CPU before reaching the target DPU. This is an important distinction to current distributed systems (e.g. following the mapreduce programming model) which do not provide opportunities to overlap communication with processing. 

**(Review43C) Relationship between PIM and PNM systems?**  
**(Review43C) PIM is not tied to the memory technology**  
	PIM [16,30] promotes the concept of seamlessly integrating processing units directly within main memory chips while maintaining the plug and play capabilities for CPU-based systems. PNM [4] assumes the existence of a logic chip 3D-stacked under one or more DRAM modules. 
We acknowledge that recent technological developments favor PIM configuration which are loosely integrated with DRAM technology (i.e. PNM). This does not affect our paper’s motivation and conclusions (see next).

**(Review43C) PIM processors do not necessarily operate in isolation.**  
	We would like to point out that different configurations of PIM systems exhibit varying levels of isolation. UPMEM’s PIM is an example of physical isolation, not allowing direct communication between compute nodes. Alternate PIM configurations (including PNM systems) offer logical isolation [4] but still need to deal with the cost of accessing remote memory partitions [4]. Our solution is structured to provide opportunities to mask the communication cost associated with either physical or logical isolation apparent in PIM processors. 

**(Review43A) What are the generalizable findings and insights of this work?**  
**(Review43C) Is DSky generalizable for different PIM/PNM configuration?**  
The proposed work is generalizable across two dimensions: (1) It provides a roadmap advocating for intelligent partitioning strategies suitable for other complex database operators (i.e. Join, Top-k selection). Complex database operators exhibit varying performance characteristics under different distributions. This work demonstrates that even under the tight architectural constraints of PIM systems, intelligent partitioning allows for avoiding expensive merging operations, hiding communication latency, and enabling massive parallelism. These principles are applicable to other complex operators that we plan to address in future work.
(2) The work is also applicable for different configurations of fully programmable PIM (including PNM) systems. Different PIM configurations exhibit varying levels of isolation, quantified by the communication overhead.  While the overhead maybe lower or higher under different architectural constraints (including different memory technologies), it is not negligible (even for PNM). This fact prioritizes local parallel processing over frequent data exchanges. As we focus on masking communication latency, our algorithmic optimizations are still applicable to other PIM configurations. Note that UPMEM’s PIM configuration and the skyline operator present the worst-case scenario encountered in the intersection of PIM processing and data management (the combination of UPMEM’s physical isolation with the inherently sequential skyline problem).

**(Review43A) Why the PIM architecture did not have measurements?**  
Since we did not have access to the actual PIM  (proprietary) hardware, all measurements were produced using UPMEM’s Cycle Accurate Simulator (CAS) which however also models the pipeline delays associated with DPU data access. UPMEM’s FPGA implementation emulates 8 DPUs at 200MHz and their DRAM memory (i.e. 64MB each) communicating through PCIe. CAS produces comparable latency measurements to the ones achieved on the FPGA. In addition, it simulates up to 4096 DPUs allowing for clock frequency adjustments up to 750MHz. Communication through the host was modelled after the speed of the PCIe interface, although UPMEM’s PIM would support direct access from the host.

**(Review43B) Why do you never mention cache hierarchies in the intro?**  
The skyline operator exhibits limited spatial and temporal locality because each point in the candidate set is accessed with varying frequency since it might dominate only few other points. As a result cache hierarchies will not be beneficial when processing large amounts of data. 

**(Review43C) Provide more information regarding energy calculations?**  
Our energy calculation was based on the Thermal Design Power (TDP) values from the corresponding spec sheets of Intel and NVIDIA (i.e. CPU=105 W, GPU=600 W). According to UPMEM the TDP for each 16 GB DIMM (256 PIM cores) is 10 W. For a system with 4096 PIM cores we calculated 160 W. The energy was calculated using the number of DTs 2^26 tuples on 16 dimension queries for all implementations.

**(Review43D) Elaborate on the standard skyline dataset generator.**  
The skyline generator proposed in [9] is the one commonly used in the database research community for generating synthetic data for skyline algorithmic comparisons (for example [7,8,11,25,26,31]). It provides parameters to create three types of distributions: correlated, independent, and anticorrelated.

**(Review43D) Removal of claim related to presenting the skeleton of a general parallel framework?**  
	We are happy to remove this claim from the paper.


Comment @A1 by Reviewer B
---------------------------------------------------------------------------
After TPC discussion, the reviewers recommend the authors to address the reviewers' comments in the camera ready version of the paper. In particular:
- Please explain clearly how is the Pareto optimality related to the Skyline Operator and when/why Skyline Operator has been used extensively in the database community;
- Please refer to some previous literature on Pareto analysis for multi-objective DSE of computer architectures and compilers such as RESPIR TCAD2009;
- Please present a practical use case of application of the Skyline operator for the multi-objective optimization of manycore architectures;