Stack Explorer

Dask

data-processing library

Library for parallel computing in Python

Official site

Supported languages

Pros and Cons

Ventajas

  • + Easy parallelization
  • + Familiar API (pandas/numpy)
  • + Scales to clusters
  • + Lazy evaluation
  • + API compatible with pandas, NumPy and scikit-learn
  • + Scales from laptop to clusters
  • + Lazy evaluation for optimization
  • + Dashboard for monitoring
  • + Integration with PyData ecosystem
  • + Supports larger-than-memory data

Desventajas

  • - Overhead
  • - Complex debugging
  • - Overhead for small datasets
  • - Complex distributed debugging
  • - Not all pandas functions supported
  • - Cluster configuration can be difficult
  • - Variable performance depending on operation

Casos de Uso

  • Big data processing
  • Parallel computing
  • ETL
  • Distributed ML
  • Processing larger-than-RAM data
  • Parallelizing pandas workflows
  • Distributed ETL
  • Feature engineering at scale
  • Big data exploratory analysis
  • Distributed machine learning

Related Technologies