uccross.github.io

Logo of the Center for Research in Open Source Software

Open Source Project Ideas

Woman draw a light bulb in white board

This is a list of projects and ideas associated to each. Each of this can be developed into class, senior, or summer projects. If you are interested in any of these projects, please read this on how to get involved. If you have any questions, please visit our Gitter channel: Join the chat at 
https://gitter.im/uccross/gsoc

ANNOUNCING - Summer Undergraduate Research Program in Open Source Projects at UCSC.

Table of contents:

Table of contents generated with markdown-toc

LiveHD

Projects for LiveHD.

   
Title Floorplaner for LiveHD
Description Implement some state-of-the-art floorplaner with hierarchy and regularity for LiveHD. The goal is to typically target 10-1000 blocks.
Mentor(s) Jose Renau
Skills C++17
Difficulty Medium
Link https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md#floorplaner
   
Title LNAST Syntax Check Pass
Description LiveHD pass to check that the LNAST tree structure is valid.
Mentor(s) Jose Renau
Skills C++17
Difficulty Medium
Link https://masc.soe.ucsc.edu/lnast-doc/#introduction
   
Title CHISEL/FIRRTL Checks
Description Create tests and sample CHISEL tests to test the FIRRTL passes in LiveHD
Mentor(s) Jose Renau
Skills CHISEL, scala, C++17
Difficulty Medium
Link https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md#firrtl-2-lnast-hunter-coffman
   
Title Mockturtle Synthesis
Description Add more blocks to the Mockturtle pass, testing cases, parallize calls
Mentor(s) Jose Renau
Skills C++17
Difficulty Medium
Link https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md#parallel-and-hierarchical-synthesis-with-mockturtle
   
Title Tree-sitter Pyrope Grammar
Description Build a tree-sitter Pyrope grammar
Mentor(s) Jose Renau
Skills javascript, C
Difficulty Medium
Link https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md#tree-sitter-pyrope
   
Title Pyrope Language Server (vim/atom interface)
Description Language Server is the new standard for editors feedback. The goal is to implement a language server for Pyrope
Mentor(s) Jose Renau
Skills javascript, C++17
Difficulty Medium
Link https://langserver.org/
   
Title Graph partition/coloring
Description Traverse Lgraph and interface with kahypar to partition graphs
Mentor(s) Jose Renau
Skills javascript, C++17
Difficulty Medium
Link https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md#lgraph-partitiondecompositioncoloring, https://github.com/SebastianSchlag/kahypar
   
Title Competitive Analysis vivado, altera, ABC, mockturtle
Description Create a test suite and scripts to benchmark synthesis results
Mentor(s) Jose Renau
Skills javascript, TCL
Difficulty Medium
Link https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md#synthesis-asicfpga-competitive-analysis
   
Title Useful LGraph Passes
Description Create several typical useful netlist checks like detect combinational loops, cross-clock checks, LGraph consistency checks
Mentor(s) Jose Renau
Skills C++17
Difficulty Low
Link https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md#useful-lgraph-passes
   
Title Performance Regression Infrastructure
Description Website and reporting system using prometheous.io to monitor performance
Mentor(s) Jose Renau
Skills javascript
Difficulty Low
Link https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md#performance-monitoring-infrastructure

CAvSAT

Inconsistent relational databases are the ones that violate one or more integrity constraints defined over their schema. We are developing CAvSAT, which aims to be a scalable and comprehensive system for query answering over inconsistent databases.

   
Title CAvSAT (Consistent Answers via Satisfiability Solving)
Mentors Akhil Dixit, Phokion G. Kolaitis
Skills Java and SQL (required), Node.js and React (optional)
Description In general, computing consistent answers over inconsistent databases is an intractable problem. However, for certain classes of SQL queries and integrity constraints, there is an efficient algorithm to compute consistent answers. The first task of this project is to make the student familiar with the literature and implement this algorithm. Second, the student will conduct experiments with both synthetic and real-world datasets, to compare the performance of this algorithm with other methods. If the student is interested in front-end or full-stack development, they may work on developing some of CAvSAT’s user interface components using technologies such as React and write code that connects to CAvSAT’s backend via RESTful APIs.
Papers for Reference https://dl.acm.org/doi/10.1145/3299869.3300095, https://link.springer.com/chapter/10.1007/978-3-030-24258-9_8, https://dl.acm.org/doi/10.1145/303976.303983, https://dl.acm.org/doi/10.1145/3068334
Project Link https://github.com/uccross/cavsat
Difficulty Medium

SkyhookDM

[SkyhookDM]The Skyhook Data Management project extends object storage with data management functionality for tabular data. SkyhookDM enables storing and query database tables in Ceph distributed object storage, and supports multiple data formats including Google Flatbuffers and Apache Arrow. SkyhookDM partitions and formats data as objects, and we utilize Ceph’s object class extension mechanism ‘cls’ to develop data management methods that can be executed directly within storage.
Methods include offloading processing to storage (e.g., SELECT, PROJECT) as well as physical design methods including indexing and data layouts. Please see current project ideas listed below. SkyhookDM on Github.


Ingest data via Python, convert to pyarrow tables, horizonally partition and write to SkyhookDM

This project will create a Python client to read raw input data in CSV and JSON form (no nested data), convert to PyArrow tables, partition pyarrow tables horizontally (creating an individual Arrow table for each partition) and write each partition to SkyhookDM as an independent object, formatted with our Flatbuffer metadata wrapper.
Partitioning should be done with JumpConsistentHash on the specified key columns, and will augment to our Python client writer that currently performs vertical (column) based partitioning. Github issue.


Compaction of formatted database partitions within objects

This project will develop object class methods that will merge (or conversely split) formatted data partitions within an object. Self-contained partitions are written (appended) to objects and over time objects may contain a sequence of independent formatted data structures e.g., a sequence of Arrow tables each representing a sub-partition. A compaction request will invoke this method that will iterate over the data structures, combining (or splitting) them into a single larger data structure representing the complete data partition. In essence, this methods will perform a read-modify-write operation on an object’s local data. Github issue.


Port wiki to ReadTheDocs or other documentation platform

SkyhookDM’s documentation is currently written as Github Wiki pages. We would like to move it to another platform such as ReadTheDocs, to reorganize it and rewrite some sections as part of this effort. Github issue.


Database statistics collection on partitioned data


Add support for a relevant subset of operations on List data types

Array data is currently stored as lists within Arrow tables inside SkyhookDM. This project will investigate and implement a small subset of operations on list data types that can be offloaded (“pushed down”) into storage for query processing. Common list manipulations that perform data reduction such as filters or summary/agg methods (min, max, first, in) will be most useful to apply withing storage, since these will reduce network IO transferred back to the client from the storage layer. We can look to awkward array for reference of common operations on scientific array data. Github issue.


SkyhookDM/HDF5

HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.

The HDF5 technology suite includes:


HDF5 - Apache Arrow Integration

Apache Arrow creates in-memory column stores that can be used to manage streamed data. Accessing this data through the HDF5 API would allow applications to take advantage of transient, column-oriented data streams, such as realtime data from high-speed scientific instruments and cameras. Bridging the gap between science applications and analytics tools that use HDF5 and Apache Arrow data streams could bring new kinds of tools and data together. This project will create a standalone HDF5 VOL connector that allows applications to make HDF5 calls to access Apache Arrow data.


HDF5 - Ceph/RADOS Integration

The Ceph distributed storage system provides object, block, and file system layer interfaces. A prototype HDF5 VOL connector has been developed to access the RADOS object storage layer, enabling HDF5 objects (datasets, groups, etc) to be directly stored as RADOS objects. This project would expand the capabilities of this VOL connector, enabling HDF5 applications to store data directly in RADOS pools.


Column-storage in HDF5

Column-oriented storage provides efficient access to fields within records, across many rows. Adding this storage method to HDF5 would dramatically improve performance for applications that primarily access subsets of the fields in an HDF5 dataset.


Sparse data storage in HDF5

Sparse matrices have applications in many fields within science and mathematics. Storing and accesssing them in HDF5 is inefficient though, as HDF5 is currently optimized for storing dense arrays. Adding efficient storage of sparse data in HDF5 would dramatically improve performance for applications that wish to store and access sparse data. This could extend beyond sparse matrices proper, and include any form of sparsely populated array or table.


Metadata search in HDF5 with Database Solutions

Relational databases excel at many tasks, one of which is content queries. HDF5 does not currently have good methods for indexing and searching available to user applications, although protoyping work has been performed in a git branch. Instead of adding index and query operations directly to HDF5, this project would instead connect a database package, such as RocksDB or VoltDB, with HDF5 and perform query and index operations in the database and array-oriented I/O with HDF5.

SkyhookDM/Proactive Data Containers (PDC)

Proactive Data Containers (PDC) are containers within a locus of storage (memory, NVRAM, disk, etc.) that store science data in an object-oriented manner. Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning


PDC - Ceph/RADOS Integration

The Ceph distributed storage system provides object, block, and file system layer interfaces. PDC has plugabble storage mechanisms and the RADOS object storage layer within Ceph is an ideal target for storing PDC objects. This project would extend PDC to store its objects in RADOS pools.

Popper

Popper is a container-native workflow execution engine. This is a list of ideas for projects related to this project is below. For any question, please visit the Gitter channel or Slack channel.


CI Starter Workflows

Create example workflows for using Popper for CI’ing projects in distinct languages. Would create an examples/ folder in the Popper repo, and then create one YAML file per example, where every example represents a distinct example for a distinct language. For example, take this example of the Github Actions starter workflow. We would write the same but using in the syntax of Popper. In this case, there would be an examples/ci/python-app.yml file, but the syntax is not the one for Github Actions, it is the syntax for Popper. As part of this, How-To guides will be added to the documentation (docs/sections/guides) for each distinct language.


Machine Learning Performance Validation Workflows

Take a machine learning library and create a workflow for reproducing results presented in an academic article, a blog post, or a repository README file. For example, take the xLearn main README.md, where they show some charts about the performance of the library. For this project, we would create a workflow that obtains data, installs the library, runs the benchmarks, and produces the charts in PDF/PNG format.

Other examples for which the same can be done are Horovod, Light GBM, Wordbatch, Open Graph Benchmarks or any other blog post, README, or article that you find for which you would like to reproduce its results.


Systems Performance Validation Workflows

Same as above but for reproducing performance reports for high-performance systems frameworks. For example, we can take the performance report that is periodically generated for SPDK, and create a workflow that builds SPDK, prepares the storage drives (SSDs), runs the performance benchmarks, and lastly creates a PDF containing the resulting charts.

Other projects for which this can be done as well are DPDK, Seastar, Scylla, or any other you would like to work on.


Boombox - Container-native Task and Workflow Specification

Extract the Workflow Spec from the Popper repository and place it in a standalone repository. As part of this, we will also extend the definition so that it can also be used to define generic, container-native tasks. For example:

tasks:
  install-deps:
    uses: docker://node:12-alpine
    runs: [yarn, install]
  build:
    uses: docker://node:12-alpine
    runs: [yarn, build]
  test:
    uses: docker://node:12-alpine
    runs: [yarn, test]

A task has the same structure as a Popper Workflow step, with the difference there is no ordering implied in its definition. In other words, a workflow YAML file represents a sequence of steps, whereas a task definition file doesn’t. In YAML terms, tasks is a dictionary whereas steps is a list.


Breakdancer - YAML-based Task Automation

Create a tool called breakdancer (installed as bd) in Go that takes a YAML file containing Boombox task definitions (see above), and executes them on-demand. This allows users to easily automate tasks without having to code them in a programming language, as they only need to provide a YAML file describing which images to run, what arguments to pass, and define Bash scripts inline if needed. Another way of looking at this: breakdancer is as a CLI on-the-fly tool.

Given a YAML file with tasks like the one shown above, assuming it is stored in a tasks.yml file, the following runs the install-deps task:

bd install-deps
bd test

For this project, we will create a brand new repository, containing all the usual elements of an open source project.


Podman Container Engine

Using the Podman Python API, add support for building and running containers on the Podman engine. Similarly to how it is currently done for Docker (see code here).


Kaniko support

Kaniko is a tool that makes it possible to build docker images without root privileges. The Kaniko builder runs inside a docker container as a normal container and pushes images directly to a container registry. In addition, it can be used to speed up the build of images in scenarios where the Docker cache is cold (e.g. in CI systems).

This project consists of adding support for Kaniko in Popper. The main idea is to expose a --kaniko flag that enables the use of Kaniko when building images. For this, the user needs to specify a registry account in the configuration options.


Port website and documentation to Hugo Docsy

Consolidate the documentation and landing page of the project by migrating to Hugo, and use the Docsy template. This page will reside in its own repository.


Add Introduction Section to Documentation

Add an introduction section to the documentation. With the aid of diagrams, this section will explain the concepts of: workflows, containers, container runtimes, container engines, container-native workflows, and resource managers.


Computational Research Guide

Create a guide on how to run python- and R-based computational projects using Docker, Popper, Travis and Zenodo in order to address the “long tail” of computational research. One way in which we can guide this work is by looking at codeocean.com and how they address these issues of usability. We would have a list of predefined workflows that someone using Jupyter (R or Python) for computational science research can copy to their projects so that they can launch a jupyter notebook in interactive mode and also run it in non-interactive mode. This guide would explain how to download/upload datasets to Zenodo as well.

You can think of computational projects like R or Python code, that make use of Jupyter and that download data from Zenodo. For example, we can create a github repository template like this one https://drivendata.github.io/cookiecutter-data-science/ that people can checkout. We would modify that one so that it includes a Popper workflow to run the distinct steps. We can create two templates, one for python and another for R. For example this one too: https://arxiv.org/abs/1401.200. Make it so that it reads like Popper as an alternative to codeocean.com


Dockerfile Generator Application

Create a cross-platform GUI to aid in the generation of Dockerfiles. Similar to what https://phpdocker.io/generator but for creating images that are based on Debian (or Ubuntu) or Python, with support for specifying a list of apt and pip packages to include in the Dockerfile. To implement this in a cross-platform manner, a library such as Kivy can be used.


Improve Presentation of Underlying Concepts in Documentation

Provide introductory material that serves as an overview to the different concepts and technologies involved in executing container-native workflows: Operating System (Linux), Language Runtimes (e.g. Python), Containers (Docker, Singularity), Resource Managers (Kubernetes, SLURM). The goal being to explain clearly what Popper does and what can users accomplish by using it. Currently, the documentation assumes expertise in all those topics but we would like to lower the entry barrier for new users.