All Projects → tweag → Porcupine

tweag / Porcupine

Express parametrable, composable and portable data pipelines

Programming Languages

haskell
3896 projects

Projects that are alternatives of or similar to Porcupine

Canvasxpress
JavaScript VisualizationTools
Stars: ✭ 247 (+252.86%)
Mutual labels:  analytics, reproducible-research
galaksio
An easy-to-use way for running Galaxy workflows.
Stars: ✭ 19 (-72.86%)
Mutual labels:  reproducible-research, workflows
panoptes
Monitor computational workflows in real time
Stars: ✭ 45 (-35.71%)
Mutual labels:  reproducible-research, workflows
Amplify Js
A declarative JavaScript library for application development using cloud services.
Stars: ✭ 8,539 (+12098.57%)
Mutual labels:  analytics
Weeklypedia
A weekly email update of all the most popular wikipedia articles
Stars: ✭ 50 (-28.57%)
Mutual labels:  analytics
Nozzle
Nozzle is a report generation toolkit for data analysis pipelines implemented in R.
Stars: ✭ 59 (-15.71%)
Mutual labels:  reproducible-research
Flyte
Flyte binds together the tools you use into easily defined, automated workflows
Stars: ✭ 67 (-4.29%)
Mutual labels:  workflows
Dashblocks
Enable Analytics in your Apps
Stars: ✭ 48 (-31.43%)
Mutual labels:  analytics
Eventql
Distributed "massively parallel" SQL query engine
Stars: ✭ 1,121 (+1501.43%)
Mutual labels:  analytics
Kindmetrics
Kind metrics analytics for your website
Stars: ✭ 57 (-18.57%)
Mutual labels:  analytics
Drake Examples
Example workflows for the drake R package
Stars: ✭ 57 (-18.57%)
Mutual labels:  reproducible-research
Monocle
Detect anomalies in your GitHub/Gerrit projects
Stars: ✭ 50 (-28.57%)
Mutual labels:  analytics
Data Science Best Resources
Carefully curated resource links for data science in one place
Stars: ✭ 1,104 (+1477.14%)
Mutual labels:  analytics
Batchman
This library for Android will take any set of events and batch them up before sending it to the server. It also supports persisting the events on disk so that no event gets lost because of an app crash. Typically used for developing any in-house analytics sdk where you have to make a single api call to push events to the server but you want to optimize the calls so that the api call happens only once per x events, or say once per x minutes. It also supports exponential backoff in case of network failures
Stars: ✭ 50 (-28.57%)
Mutual labels:  analytics
Dashboard Extension Online Map Item
⛔ DEPRECATED. This project was moved to a new repository. Visit https://github.com/DevExpress/dashboard-extensions to find an updated version.
Stars: ✭ 65 (-7.14%)
Mutual labels:  analytics
Nsdb
Natural Series Database
Stars: ✭ 49 (-30%)
Mutual labels:  analytics
Angle Grinder
Slice and dice logs on the command line
Stars: ✭ 1,118 (+1497.14%)
Mutual labels:  analytics
Evalai
☁️ 🚀 📊 📈 Evaluating state of the art in AI
Stars: ✭ 1,087 (+1452.86%)
Mutual labels:  reproducible-research
River Admin
🚀 A shiny admin interface for django-river built with DRF, Vue & Vuetify
Stars: ✭ 55 (-21.43%)
Mutual labels:  workflows
Singular Skadnetwork App
Sample apps demonstrating the logic needed to implement SKAdNetwork as an ad network, publisher and advertiser.
Stars: ✭ 59 (-15.71%)
Mutual labels:  analytics

logo

CircleCI Join the chat at https://gitter.im/tweag/porcupine

Porcupine is a tool aimed at people who want to express in Haskell general data manipulation and analysis tasks,

  1. In a way that is agnostic from the source of the input data and from the destination of the end results,
  2. So that a pipeline can be re-executed in a different environment and on different data without recompiling, by just a shift in its configuration,
  3. While facilitating code reusability (any task can always be reused as part of a bigger pipeline).

Porcupine specifically targets teams containing skills ranging from those of data scientists to those of data/software engineers.

Resources

Porcupine's development

Porcupine's development happens mainly inside NovaDiscovery's internal codebase, where a porcupine's fork resides. But we often synchronise this internal repo and porcupine's github repo. This is why commits tend to appear by batches on porcupine's github.

Lately, a lot of effort has been invested in developping Kernmantle which should provide the new task representation (see below in Future plans).

Participating to porcupine's development

Issues and MRs are welcome :)

Future plans

These features are being developed and should land soon:

  • porcupine-servant: a servant app can directly serve porcupine's pipelines as routes, and expose a single configuration for the whole server
  • enhancement of the API to run tasks: runPipelineTask would remain in place but be a tiny wrapper over a slightly lower-level API. This makes it easier to run pipelines in different contexts (like that of porcupine-servant)
  • common configuration representation: for now porcupine can only handle config via a yaml/json file + CLI. Some applications can require other configuration sources (GraphQL, several config files that override one another, etc). We want to have a common tree format that every configuration source get translated too, and just merge all these trees afterwards, so each config source is fully decoupled from the others and can be activated at will

The following are things we'd like to start working on:

  • switch to cas-store: porcupine's dependency on funflow is mainly for the purpose of caching. Now that cas-store is a separate project, porcupine can directly depend on it. This will simplify the implementation of PTask and make it easier to integrate PTasks with other libraries.
  • implement PTask over a Kernmantle Rope: this is the main reason we started the work on Kernmantle, so it could become a uniform pipeline API, independent of the effects the pipeline performs (caching, collecting options or required resources, etc). Both porcupine and funflow would become collections of Kernmantle effects and handlers, and would therefore be seamlessly interoperable. Developpers would also be able to add their own custom effects to a pipeline. This would probably mean the death of reader-soup, as the LocationAccessors could directly be embedded as Kernmatle effects.
  • package porcupine's VirtualTree as a separate package: all the code that is not strictly speaking related to tasks would be usable separately (for instance to be used in Kernmantle effects handlers).

F.A.Q.

How are Porcupine and Funflow related?

Porcupine uses Funflow internally to provide caching. Funflow's API is centered around the ArrowFlow class. PTask (porcupine's main computation unit) implements ArrowFlow too, so usual funflow operations are usable on PTasks too.

Aside from that, funflow and porcupine don't operate at the same level of abstraction: funflow is for software devs building applications the way they want, while porcupine is higher-level and more featureful, and targets software devs at the same time as modelers or data analysts. However, porcupine doesn't make any choice in terms of computation, visualization, etc. libraries or anything. That part is still up to the user.

The main goal of Porcupine is to be a tool to structure your app, a backbone that helps you kickstart e.g. a data pipeline/analytics application while keeping the boilerplate (config, I/O) to a minimum, while providing a common framework if you have code (tasks, serializing functions) to share between several applications of that type. But since the arrow and caching API is the same in both Funflow and Porcupine, as a software dev you can start by using porcupine, and if you realize you don't actually need the high level features (config, rebinding of inputs, logging, etc) then drop the dependency and transition to Funflow's level.

Can the tasks run in a distributed fashion?

Funflow provides a worker demon that the main pipeline can distribute docker-containerized tasks to. For pure Haskell functions, there is funflow-jobs but it's experimental.

So it could be used with funflow-jobs, but for now porcupine has only ever been used for parallel execution of tasks. We recently started thinking about how the funflow/porcupine's model could be adapted to run a pipeline in a cluster in a decentralized fashion, and we have some promising ideas so that feature may appear in the future.

Another solution (which is the one used by our client) is to use an external job queue (like celery) which starts porcupine pipeline instances. This is made easy by the fact that all the configuration of a pipeline instance is exposed by porcupine, and therefore can be set by the program that puts the jobs in the queue (as one JSON file).

I like the idea of tasks that automatically maintain and merge their requirements when they compose, but I want to deal with configuration, CLI and everything myself. Can I do that?

Of course! That means you would replace the call to runPipelineTask by custom code. You want to have a look at the splitTask lens. It will separate a task in its two components: its VirtualTree of requirements (which you can treat however you please, the goal being to turn it into a DataAccessTree) and a RunnableTask which you can feed to execRunnableTask once you have composed a DataAccessTree to feed it. Although note that this part of the API might change a bit in future versions.

Is Porcupine related to Hedgehog?

Can see where that comes from ^^, but nope, not all R.O.U.S.s are related. (And also, hedgehogs aren't rodents)

Although we do have a few tests using Hedgehog (and will possibly add more).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].