Data Hut’s Data Pipeline

Diagram of the flow in the DataHut pipeline.

Datahut.ai is a site we launched this spring. It combines manual curation and automated analytics to provide insight into data science and big data open source projects. We currently cover 128 different projects, from AirFlow to XGBoost.

The code for DataHut started last year with a single Python script to download Git commits and a Jupyter notebook for data analysis and visualization. This allowed us to gain insight into our data. Over time, this evolved into a more complex data pipeline, as shown above. In this post, I describe the pipeline we’ve built and the lessons we learned.

Git Metadata Downloading

Most of our analytics are created using Git commit histories. We currently download this data via GitHub’s API. We do not rely on data aggregations specific to GitHub, so that we can accommodate other hosting services and direct repository clones in the future. The download code is somewhat complex, due to the need to accommodate multiple request types, rate limiting, and potential failures. The main download script runs a specific number of requests and then exits. When run again, typically at the start of the next hour, it picks up from where it left off. We save all the original responses we receive in compressed JSON format based on time range (currently over 1M files). Under each repository, we append to the most recent file until a given size is reached, and then we split on a date boundary. This creates a time-based partitioning of the data. After capturing the raw data, we post-process it into a single large compressed csv file containing only the fields we need for our analytics.

Lessons Learned: Initial data extraction/ingestion should be restartable and idempotent. It is useful to save the raw response data in case you want to re-process it later (e.g. due to a bug or additional fields you want to include). Finally, when managing raw time series data in files, only the most recent files should be writable. After compaction, splitting, and accounting for late arrivals, files become read-only.

Graph Database Creation

Hand-curated data, including project names, categories, descriptions, project backers, and associated external links, is entered into a series of Google Sheets. These are then downloaded as CSV files and stored in a Git repository.  From this data, we create a knowledge graph — a representation of concepts (e.g. projects, categories, backers) linked by their relationships (e.g. category-subcategory, projects and their backers). This graph is stored in the Neo4j graph database, where it can be used for analytics and content generation.

Lessons Learned: Hand-entered data can contain many errors and inconsistencies. The process of converting it to a structured data store (e.g. graph database or a relational database) can include validation to find these problems before it propagates downstream. With the current DataHut functionality, a relational database would be sufficient. However, we plan to add links between projects (e.g. due to dependencies, historical relationships, or common developers). With these links, a graph database will enable more sophisticated analyses (e.g. PageRank or various graph similarity measures).

Data Analysis and Visualization

Although our data analysis started as a self-contained Jupyter Notebook, written using Pandas and Matplotlib, we eventually refactored the code, putting most of the functionality into reusable Python modules that can be tested externally and then called from the Notebook. Finally, all the computed metrics are added to the graph database.

Lessons Learned: We found it helpful to create reusable abstractions for time series analysis on top of the lower-level Pandas functionality. These abstractions include a time series sequence, time-based aggregations of a time series, category-based aggregations of a time series, and metrics over all of these data sets. This approach is less-error prone, easier to code than raw Pandas, and easier to test than inline Notebook code. Although most of the code is now outside of the notebook, the overall sequencing is still performed by the Notebook. This makes it easy to try out and debug any changes to our analytics calculations, since we can see the results of each step in tabular and/or graphical form.

Website Generation

To generate the page content for the website, Python scripts read the graph database and output Restructured Text (RST) files for each category, community, and project page. We use Bootstrap template and Sphinx, the documentation generation tool, to convert the RST pages to HTML and provide the overall layout and navigation.

Lessons Learned: Hand-entered data can contain many errors and inconsistencies. The process of converting it to a structured data store (e.g. graph database or a relational database) can include validation to find these problems before it propagates downstream. With the current DataHut functionality, a relational database would be sufficient. However, we plan to add links between projects (e.g. due to dependencies, historical relationships, or common developers). With these links, a graph database will enable more sophisticated analyses (e.g. PageRank or various graph similarity measures).

Orchestration

The analytics for the site are updated monthly. We download the latest data after the start of the month. The download script is wrapped in a simple Bash shell script that runs hourly (to keep within the throttle rate). Since we only do this monthly, we start the script manually and stop it once there is no more data to download.

The rest of the workflow (up through HTML generation) is orchestrated through SnakeMake, an open source workflow engine originally from the bioinformatics community. Like the original make tool, it uses file dependencies and modification times to automatically determine which steps in a workflow need to be (re)run. The SnakeMake workflow is started manually after the data download has completed. It may need to be run several times, to debug and test any changes to the site or analytics.

Lessons Learned: Given our update schedule, manually starting the download phase and the SnakeMake phase is sufficient and avoids unnecessary complication of our pipeline. If we updated more frequently, scheduled executions of our pipeline would make sense. This would also require more monitoring and alerting capabilities. The use of SnakeMake has made it easy to regenerate the site after changes and is much cleaner than ad hoc Shell or Python scripts.

Deployment

We use a static site model with no backend database. This is for loading speed, user privacy, and ease of deployment. Once we have generated the site HTML and other content via Sphinx, we use CloudFlare Pages to deploy and host the site. We simply copy the content into a separate deployment Git repository and push to GitHub. CloudFlare polls for updates to the repository and updates the site.

Lessons Learned: This approach allows us to avoid managing servers. That is a big win in terms of reducing the initial and ongoing workload. Plus, we have the peace of mind that comes from a simple runtime stack with few potential security issues.

Find Out More

In May, I gave a talk at SF Python about DataHut, its data pipeline, and insights from the analytics we produce. The video is now up on YouTube: https://youtu.be/u6A_Hj55bT0.

If you would like Benedat to build a data pipeline for you, we are available for contract work. Contact us at info@benedat.com.