Checkpoint Report
project checkpoint URL: https://parallelpandas.weebly.com/checkpoint.html
Notes on Modification of Project Idea
At the initial stage of the project, we decided to implement a parallel version of selected heavily-used functions of the Pandas library using CUDA on GPU. However, after doing extensive research and initial test, we realized that there are many limitations that constrains the amount of parallel optimizations we could do. At the same time, while we research on the topic, we found out that K-means and K-means++ might be a more interesting idea to pursue. Therefore, we modified the project proposal to reflect what we will be working on from now on. In this checkpoint report, we will give a detailed explanation on what we have done so far before and after we changed topic.
Notes on Modification of Project Idea
At the initial stage of the project, we decided to implement a parallel version of selected heavily-used functions of the Pandas library using CUDA on GPU. However, after doing extensive research and initial test, we realized that there are many limitations that constrains the amount of parallel optimizations we could do. At the same time, while we research on the topic, we found out that K-means and K-means++ might be a more interesting idea to pursue. Therefore, we modified the project proposal to reflect what we will be working on from now on. In this checkpoint report, we will give a detailed explanation on what we have done so far before and after we changed topic.
ORIGINAL project idea: Parallelize Pandas Functions
- work completed so far:
- Set up the data pipeline for Python data to be run on multiprocessors using both PyCUDA and Cython;
- Use PyCUDA to implement several reduction Pandas functions, such as SUM and MEAN, on GPU;
- Use Cython to inject OpenMP code to the original Pandas GroupBy code to allow parallel processing.
- Issues we faced:
- After using PyCUDA and implemented the reductions functions on CUDA, we found out that PyCUDA actually have supported most of these reduction functions, and their implementations are pretty up to date and efficient. Therefore, our work in this area might not be quite useful
- Initially, our idea for next step is to use CUDA on GPU to parallelize the non-trivial Pandas functions, such as GroupBy and Join. However, we found out that most of these functions require either HashTable or String operations, which are both not well supported by CUDA on GPU. It would be too much of a overhead to transform the data and data structures and use GPU for the computations.
- We then realized that the fundamental issue of our project is that, most of the functions of Pandas are not computationally heavy. This makes GPU not a good choice. Then we turned to OpenMP, which is supported by Cython. However, after we injected OpenMP code on the original Pandas GroupBy code, we realized that the parallel version is much slower than the original sequential one for the GroupBy functions. Also, the source code of these functions are pretty complicated, given the design of the core data structures. Therefore, there is limit possibilities for us to bypass most part of the original implementation and parallelize the code effectively.
NEW project idea: Parallelize Clustering Algorithm
- Work completed so far:
- Modify the K-means and K-means++ source codes to inject time profiling codes;
- Implemented the accuracy evaluation script to calculate sum of squared error (SSE) ;
- Cleaned and customized clustering datasets with various size and parameter sets to be used in our project;
- Ran and recorded baseline for the sequential algorithm
- Learned to use GNU gprof tool to profile time spent on each function of the script;
- Baseline Result (sequential version):
3. A New List of Goals:
Plan to achieve
- Make use of MPI to effectively speed up K-means algorithm;
- Use MPIand other performance optimization strategies to speed up the sequential initialization K-means++ algorithm;
- Vary the input size such as dimension of the data and number of clusters, and ensure that the algorithm have good speedup under each scenario.
- Since the calculation of pairwise distance is pretty computational heavy and rather independent, we would like to look into possibilities in incorporating Cuda on GPU to our MPI-optimized algorithm, and speed up the computations.
4. Issues and Concerns:
- Latedays is recently reconfigured to kill jobs that runs for more than 5 minutes. This might limit the problem size our program can tackle.
- The current dataset we found has limited number of real clusters. Therefore, even if we change the number of clusters parameter at the beginning, a lot of the clusters end up being empty, and the accuracy measurement would be the same for them as for smaller K. This might affect us when we try to make tradeoffs between performance and accuracy.
5. Updated Schedules:
- Nov 20-25: Finish basic K-means MPI optimization, and start optimizing K-means ++;
- Nov 26-30: Finish basic K-means++ MPI optimization, and start modifying strategies for different input size
- Dec 1-5: Have some progress in customizing optimization strategies with small and moderate size of data and various K values (number of clusters)
- Dec 6-10: Finish customizing optimization for small and moderate datasize, have some progress in optimization for large dataset size;
- Dec 12-16: Finish optimization for large dataset, finish report and poster.
6. What to show at the poster session:
- We would display a performance graph showing the difference in performance between our parallel implementation and the original sequential version of the algorithm under different problem size;
- We would like to provide a live demo on how the clustering algorithm performs; Hopefully we could provide a visual representation (such as a series of graph) to show how the algorithm is developed on 2-dimensional dataset.