This is a list of projects and ideas associated to each. Each of this can be developed into class, senior, or summer projects. If you are interested in any of these projects, please read this on how to get involved. If you have any questions, please visit our Gitter channel:
Table of contents:
Projects for LiveHD.
|Title||Floorplaner for LiveHD|
|Description||Implement some state-of-the-art floorplaner with hierarchy and regularity for LiveHD. The goal is to typically target 10-1000 blocks.|
|Title||LNAST Syntax Check Pass|
|Description||LiveHD pass to check that the LNAST tree structure is valid.|
|Description||Create tests and sample CHISEL tests to test the FIRRTL passes in LiveHD|
|Skills||CHISEL, scala, C++17|
|Description||Add more blocks to the Mockturtle pass, testing cases, parallize calls|
|Title||Tree-sitter Pyrope Grammar|
|Description||Build a tree-sitter Pyrope grammar|
|Title||Pyrope Language Server (vim/atom interface)|
|Description||Language Server is the new standard for editors feedback. The goal is to implement a language server for Pyrope|
|Description||Traverse Lgraph and interface with kahypar to partition graphs|
|Title||Competitive Analysis vivado, altera, ABC, mockturtle|
|Description||Create a test suite and scripts to benchmark synthesis results|
|Title||Useful LGraph Passes|
|Description||Create several typical useful netlist checks like detect combinational loops, cross-clock checks, LGraph consistency checks|
|Title||Performance Regression Infrastructure|
|Description||Website and reporting system using prometheous.io to monitor performance|
Inconsistent relational databases are the ones that violate one or more integrity constraints defined over their schema. We are developing CAvSAT, which aims to be a scalable and comprehensive system for query answering over inconsistent databases.
|Title||CAvSAT (Consistent Answers via Satisfiability Solving)|
|Mentors||Akhil Dixit, Phokion G. Kolaitis|
|Skills||Java and SQL (required), Node.js and React (optional)|
|Description||In general, computing consistent answers over inconsistent databases is an intractable problem. However, for certain classes of SQL queries and integrity constraints, there is an efficient algorithm to compute consistent answers. The first task of this project is to make the student familiar with the literature and implement this algorithm. Second, the student will conduct experiments with both synthetic and real-world datasets, to compare the performance of this algorithm with other methods. If the student is interested in front-end or full-stack development, they may work on developing some of CAvSAT’s user interface components using technologies such as React and write code that connects to CAvSAT’s backend via RESTful APIs.|
|Papers for Reference||https://dl.acm.org/doi/10.1145/3299869.3300095, https://link.springer.com/chapter/10.1007/978-3-030-24258-9_8, https://dl.acm.org/doi/10.1145/303976.303983, https://dl.acm.org/doi/10.1145/3068334|
[SkyhookDM]The Skyhook Data Management project extends object storage with data
management functionality for tabular data. SkyhookDM enables storing and query
database tables in Ceph distributed object storage, and supports multiple data
formats including Google Flatbuffers
and Apache Arrow. SkyhookDM partitions and formats
data as objects, and we utilize Ceph’s object class extension mechanism ‘cls’
to develop data management methods that can be executed directly within storage.
Methods include offloading processing to storage (e.g., SELECT, PROJECT) as well as physical design methods including indexing and data layouts. Please see current project ideas listed below. SkyhookDM on Github.
This project will create a Python client to read raw input data
in CSV and JSON form (no nested data), convert to
partition pyarrow tables horizontally (creating an individual Arrow
table for each partition) and write each partition to SkyhookDM as
an independent object, formatted with our
Flatbuffer metadata wrapper.
Partitioning should be done with JumpConsistentHash on the specified key columns, and will augment to our Python client writer that currently performs vertical (column) based partitioning. Github issue.
This project will develop object class methods that will merge (or conversely split) formatted data partitions within an object. Self-contained partitions are written (appended) to objects and over time objects may contain a sequence of independent formatted data structures e.g., a sequence of Arrow tables each representing a sub-partition. A compaction request will invoke this method that will iterate over the data structures, combining (or splitting) them into a single larger data structure representing the complete data partition. In essence, this methods will perform a read-modify-write operation on an object’s local data. Github issue.
SkyhookDM’s documentation is currently written as Github Wiki pages. We would like to move it to another platform such as ReadTheDocs, to reorganize it and rewrite some sections as part of this effort. Github issue.
Array data is currently stored as lists within Arrow tables inside SkyhookDM. This project will investigate and implement a small subset of operations on list data types that can be offloaded (“pushed down”) into storage for query processing. Common list manipulations that perform data reduction such as filters or summary/agg methods (min, max, first, in) will be most useful to apply withing storage, since these will reduce network IO transferred back to the client from the storage layer. We can look to awkward array for reference of common operations on scientific array data. Github issue.
HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.
The HDF5 technology suite includes:
Apache Arrow creates in-memory column stores that can be used to manage streamed data. Accessing this data through the HDF5 API would allow applications to take advantage of transient, column-oriented data streams, such as realtime data from high-speed scientific instruments and cameras. Bridging the gap between science applications and analytics tools that use HDF5 and Apache Arrow data streams could bring new kinds of tools and data together. This project will create a standalone HDF5 VOL connector that allows applications to make HDF5 calls to access Apache Arrow data.
The Ceph distributed storage system provides object, block, and file system layer interfaces. A prototype HDF5 VOL connector has been developed to access the RADOS object storage layer, enabling HDF5 objects (datasets, groups, etc) to be directly stored as RADOS objects. This project would expand the capabilities of this VOL connector, enabling HDF5 applications to store data directly in RADOS pools.
Column-oriented storage provides efficient access to fields within records, across many rows. Adding this storage method to HDF5 would dramatically improve performance for applications that primarily access subsets of the fields in an HDF5 dataset.
Sparse matrices have applications in many fields within science and mathematics. Storing and accesssing them in HDF5 is inefficient though, as HDF5 is currently optimized for storing dense arrays. Adding efficient storage of sparse data in HDF5 would dramatically improve performance for applications that wish to store and access sparse data. This could extend beyond sparse matrices proper, and include any form of sparsely populated array or table.
Relational databases excel at many tasks, one of which is content queries. HDF5 does not currently have good methods for indexing and searching available to user applications, although protoyping work has been performed in a git branch. Instead of adding index and query operations directly to HDF5, this project would instead connect a database package, such as RocksDB or VoltDB, with HDF5 and perform query and index operations in the database and array-oriented I/O with HDF5.
Proactive Data Containers (PDC) are containers within a locus of storage (memory, NVRAM, disk, etc.) that store science data in an object-oriented manner. Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning
The Ceph distributed storage system provides object, block, and file system layer interfaces. PDC has plugabble storage mechanisms and the RADOS object storage layer within Ceph is an ideal target for storing PDC objects. This project would extend PDC to store its objects in RADOS pools.
Create example workflows for using Popper for CI’ing projects in
distinct languages. Would create an
examples/ folder in the Popper
repo, and then create one YAML file per example, where every example
represents a distinct example for a distinct language. For example,
take this example of the Github Actions starter workflow.
We would write the same but using in the syntax of Popper. In this
case, there would be an
examples/ci/python-app.yml file, but the
syntax is not the one for Github Actions, it is the syntax for Popper.
As part of this, How-To guides will be added to the documentation
docs/sections/guides) for each distinct language.
Take a machine learning library and create a workflow for reproducing
results presented in an academic article, a blog post, or a repository
README file. For example, take the
they show some charts about the performance of the library. For this
project, we would create a workflow that obtains data, installs the
library, runs the benchmarks, and produces the charts in PDF/PNG
Other examples for which the same can be done are Horovod, Light GBM, Wordbatch, Open Graph Benchmarks or any other blog post, README, or article that you find for which you would like to reproduce its results.
Same as above but for reproducing performance reports for high-performance systems frameworks. For example, we can take the performance report that is periodically generated for SPDK, and create a workflow that builds SPDK, prepares the storage drives (SSDs), runs the performance benchmarks, and lastly creates a PDF containing the resulting charts.
Extract the Workflow Spec from the Popper repository and place it in a standalone repository. As part of this, we will also extend the definition so that it can also be used to define generic, container-native tasks. For example:
tasks: install-deps: uses: docker://node:12-alpine runs: [yarn, install] build: uses: docker://node:12-alpine runs: [yarn, build] test: uses: docker://node:12-alpine runs: [yarn, test]
A task has the same structure as a Popper Workflow step,
with the difference there is no ordering implied in its definition. In
other words, a workflow YAML file represents a sequence of steps,
whereas a task definition file doesn’t. In YAML terms,
tasks is a
steps is a list.
open source project from scratch
Create a tool called
breakdancer (installed as
bd) in Go that
takes a YAML file containing Boombox task definitions (see above), and
executes them on-demand. This allows users to easily automate tasks
without having to code them in a programming language, as they only
need to provide a YAML file describing which images to run, what
arguments to pass, and define Bash scripts inline if needed. Another
way of looking at this: breakdancer is as a CLI on-the-fly tool.
Given a YAML file with tasks like the one shown above, assuming it is
stored in a
tasks.yml file, the following runs the
bd install-deps bd test
For this project, we will create a brand new repository, containing all the usual elements of an open source project.
Kaniko is a tool that makes it possible to build docker images without root privileges. The Kaniko builder runs inside a docker container as a normal container and pushes images directly to a container registry. In addition, it can be used to speed up the build of images in scenarios where the Docker cache is cold (e.g. in CI systems).
This project consists of adding support for Kaniko in Popper. The main
idea is to expose a
--kaniko flag that enables the use of Kaniko
when building images. For this, the user needs to specify a registry
account in the configuration options.
Add an introduction section to the documentation. With the aid of diagrams, this section will explain the concepts of: workflows, containers, container runtimes, container engines, container-native workflows, and resource managers.
Create a guide on how to run python- and R-based computational projects using Docker, Popper, Travis and Zenodo in order to address the “long tail” of computational research. One way in which we can guide this work is by looking at codeocean.com and how they address these issues of usability. We would have a list of predefined workflows that someone using Jupyter (R or Python) for computational science research can copy to their projects so that they can launch a jupyter notebook in interactive mode and also run it in non-interactive mode. This guide would explain how to download/upload datasets to Zenodo as well.
You can think of computational projects like R or Python code, that make use of Jupyter and that download data from Zenodo. For example, we can create a github repository template like this one https://drivendata.github.io/cookiecutter-data-science/ that people can checkout. We would modify that one so that it includes a Popper workflow to run the distinct steps. We can create two templates, one for python and another for R. For example this one too: https://arxiv.org/abs/1401.200. Make it so that it reads like Popper as an alternative to codeocean.com
Create a cross-platform GUI to aid in the generation of Dockerfiles.
Similar to what https://phpdocker.io/generator but for creating
images that are based on Debian (or Ubuntu) or Python, with support
for specifying a list of
pip packages to include in the
Dockerfile. To implement this in a cross-platform manner, a library
such as Kivy can be used.
Provide introductory material that serves as an overview to the different concepts and technologies involved in executing container-native workflows: Operating System (Linux), Language Runtimes (e.g. Python), Containers (Docker, Singularity), Resource Managers (Kubernetes, SLURM). The goal being to explain clearly what Popper does and what can users accomplish by using it. Currently, the documentation assumes expertise in all those topics but we would like to lower the entry barrier for new users.