Data Engineering

A Map-reduce Example in Ray

Introduction Last fall, we had several blog posts on Ray, a new distributed infrastructure, focused on machine learning and data engineering. In this blog post, I explore a map-reduce example to use Ray with large scale applications. Although Ray’s documentation already has a simple map-reduce example, I want to look at a more complex problem to better understand what Ray’s capabilities are. The map-reduce algorithm is one of the most… Read More »A Map-reduce Example in Ray

Diagram of the flow in the DataHut pipeline.

Data Hut’s Data Pipeline

Datahut.ai is a site we launched this spring. It combines manual curation and automated analytics to provide insight into data science and big data open source projects. We currently cover 128 different projects, from AirFlow to XGBoost. The code for DataHut started last year with a single Python script to download Git commits and a Jupyter notebook for data analysis and visualization. This allowed us to gain insight into our… Read More »Data Hut’s Data Pipeline