| Data ingestion |
This is fundamental, and all programming systems have such functionalities to varying degrees. |
Robust JSON, CSV, CBOR file ingestion; XML and other formats can be ingested, but not in a robust manner. |
Various (improved) packages for working with JSON, CSV, markup, images, PDF, etc. An umbrella ingestion function for them. |
✓ |
★★★ (2.5) |
Data::Importers,
Data::Translators,
Omni-slurping with LLMing [post]
|
| Data wrangling facilitation |
Slicing, splitting, combining, aggregating, and summarizing data can be difficult and time-consuming. |
No serious efforts, especially in terms of streamlining data wrangling workflows. |
Two major efforts for streamlining data wrangling workflows: one using “pure” Raku (good for exploration), and another interfacing with “outside” systems. |
✓ |
★★★ (3.25) |
Data::Reshapers,
Dan,
Introduction to data wrangling with Raku [post]
|
| Statistics for data exploration |
This includes descriptive statistics (mean, median, five-point summary), summarization, outlier identification, and statistical distribution functions. |
Various attempts: some are basic and “plain Raku” (e.g. Stats); some connect to GSL (and do not work on macOS). |
A couple of major efforts exist: one is an all-in-one package, the other is spread across various packages. |
✓ |
★★★ (2.5) |
Statistics,
Statistics::Distributions,
Statistics::OutlierIdentifiers,
Data::Summarizers,
Data::TypeSystem
|
| Machine Learning (ML) algorithms (both unsupervised and supervised) |
Unsupervised ML is often used for Exploratory Data Analysis (EDA); supervised ML is used to leverage data patterns in some way, but also for certain types of EDA. |
A few packages for doing unsupervised Machine Learning (ML) (like Text::Markov). |
At least one supervised ML package connecting (binding) to external systems, and a set of unsupervised ML packages for clustering, association rules learning, fitting, tries with frequencies, and Recommendation Systems (RS). The RS and tries with frequencies can be used as classifiers. |
✓ |
★★★ (2.5) |
Algorithm::XGBoost,
ML::* packages,
Fast and compact classifier of DSL commands [post],
Chebyshev Polynomials and Fitting Workflows [post]
|
| Data visualization facilitation |
Insightful plots over data are used in Data Science most of the time. |
A few small packages for plotting, at least one connecting to external systems (like Gnuplot), none of them that useful for Data Science. |
There are two “solid” packages for Data Science visualizations, JavaScript::D3 and JavaScript::Google::Charts. There is also an ASCII-plots package, Text::Plot, which is useful when basic, coarse plots are sufficient. |
✓ |
★★★★ |
JavaScript::D3,
JavaScript::Google::Charts,
Text::Plot,
The Raku-ju hijack hack for D3.js [video],
Geographics data in Raku demo [video]
|
| Interactive computing environment(s) |
Any data exploration is done in an interactive manner, with multiple changes of the data and analysis or pattern-finding workflows. |
The (basic) Raku REPL, related Emacs major mode, and the notebook environment Jupyter::Kernel. |
In addition to pre-2021 work there are RakuMode for Wolfram Notebooks, and Jupyter::Chatbook for seamless integration with LLMs. |
✓ |
★★★★ |
Connecting Mathematica and Raku [post],
Exploratory Data Analysis with Raku [video]
|
| Literate programming (LT) |
LT is very important for Data Science (DS) because of DS needs for Reproducible Research. |
None, except Jupyter::Kernel, but that is not useful because of the lack of good graphics. |
LT is fully supported due to having multiple LT solutions, strong graphics capabilities, LLM integration, and computational document converters. |
✓ |
★★★★★ (4.5) |
Notebook transformations [post],
Raku Literate Programming via command line pipelines [video],
Conversion and evaluation of Raku files [video]
|
| Data generation and retrieval |
For didactic and development purposes, random data generation and retrieval of well-known datasets is needed. |
Nothing more than the built-in Raku random generators (pick and roll). |
Generators of random strings, words, pet names, date-times, distribution variates, and tabular datasets. Popular datasets from the R ecosystem can be downloaded and cached. |
𖧋 |
★★★★ (3.5) |
Data::Generators,
Data::ExampleDatasets,
Data::Geographics,
Geographics data in Raku demo [video]
|
| External Data Science (DS) and Machine Learning (ML) orchestration |
An effective way to do DS and ML and easily move the developed computations to other systems. This allows reuse and provides confidence that the utilized DS or ML algorithms are properly implemented and fast. |
Various projects connecting to database systems (e.g. MySQL). |
The project Dan provides bindings to the data wrangling library Polars. The project H2O::Client aims to provide both data wrangling and ML orchestrations to H2O.ai. |
𖧋 |
★★★ (2.5) |
Dan,
Proc::ZMQed,
WWW::WolframAlpha,
H2O::Client
|
| Interactive interfaces to parameterized workflows (dashboards) |
Very useful for getting data insights by dynamically changing different statistics based on parameters. |
None. |
An effort, Air::Examples, that brings interactivity via HTMX is using the Cro package set and templates. Since Google Charts provides interactivity, JavaScript::Google::Charts can be extended to have those kinds of controls and dashboards. |
𖧋 |
★ |
Air::Examples,
JavaScript::Google::Charts
|