We are pleased today to announce datahut.ai, a free web-based directory that provides statistics and analysis on the most popular data science and data engineering projects, from Apache Age to XGBoost. By combining machine learning techniques with expert knowledge, we help you navigate the open source landscape and pick the best software for your needs.
Data Hut organizes over 100 projects by category and community, visually comparing size and trends across similar projects. For each individual project, we include basic information, statistics, and trends. You can explore the site by hierarchical category, community, project name, or keyword search.
The site offers concise information without unnecessary clutter. Data Hut loads blazingly fast and allows you to quickly find what you need. We do not track any personal information in our analytics, as your privacy is important to us.
Project Metrics and Scores
In general, we consider two types of metrics: “absolute” metrics that are measured relative to other projects, and “trend” metrics that are measured relative to a project’s own history. For example, the total lines of code committed to a project or total number of individuals who contributed to the project are absolute metrics. The rate of change in lines of code added year over year or the change in active contributors are both trend metrics. Both types of metrics are needed to manifest a project’s maturity and outlook.
There are hundreds of projects out there, and usually several contenders for each category. In order to quickly spot the right projects in a given category, we combine a number of these metrics to provide an absolute size score and a relative trend score for each project. By combining multiple metrics, our analysis of a project lifecycle avoids skewed results due to outlier values.
Visualizations
The picture below shows three example visualizations from the site. The chart in the upper left shows the trend score compared to the size score for 51 projects in the Data Science category. This chart type is used for top level categories. Projects above the trend line (e.g. Hugging Face Transformers) have fast growth for their size. Projects below the trend line (e.g. Caffe2) have slow (or no) growth for their size.
The upper right chart shows the project age versus total lines committed for the Deep Learning sub-category. This chart type is used for sub-categories and communities. The size of the circles and the numbers next to the names correspond to the number of lines committed in the last year.
Finally, the bottom middle chart shows the commit history of the TensorFlow project. For each project, we compute the 12-month rolling average of lines committed, which provides a good sense of the trends in the project’s activity.
