Reuben Cummings is a data scientist and software developer skilled in business development, entrepreneurship, and programming. He is Founder & Managing Director of Nerevu Development and has worked with clients including the UN Humanitarian Data Exchange (HDX), Moringa School, and Africa’s Talking. Reuben specializes in data integration and analysis, visualization, API development, and workflow automation.

Reuben previously served in IT and Business Development roles at social enterprise Global Cycle Solutions (GCS) in Arusha, Tanzania; and was an analyst at MIDIOR Consulting in Cambridge, Massachusetts. He holds a degree in Chemical Engineering from the Massachusetts Institute of Technology, and is Lead Organizer of the Arusha Coders programmers meetup in Tanzania.

Accepted Talks:

Stream processing made easy with riko

AUDIENCE

  • data scientists (current and aspiring)
  • those who want to know more about data processing
  • those who are intimidate by "big data" (java) frameworks and are interested in a simpler, pure python alternative
  • those interested in async and/or parallel programming

DESCRIPTION

Big data processing is all the rage these days. Heavyweight frameworks such as Spark, Storm, Kafka, Samza, and Flink have taken the spotlight despite their complex setup, java dependency, and intense computer resource usage.

Those interested in simple, pure python solutions have limited options. Most alternative software is synchronous, doesn't perform well on large data sets, or is poorly documented.

This talk aims to explain stream processing and its uses, and introduce riko: a pure python stream processing library built with simplicity in mind. Complete with various examples, you’ll get to see how riko lazily processes streams via its synchronous, asynchronous, and parallel processing APIs.

OBJECTIVES

Attendees will learn what streams are, how to process them, and the benefits of stream processing. They will also see that most data isn't "big data" and therefore doesn't require complex (java) systems (*cough* spark and storm *cough*) to process it.

DETAILED ABSTRACT

Stream processing?

What are streams?

A stream is a sequence of data. The sequence can be as simple as a list of integers or as complex as a generator of dictionaries.

How do you process streams?

Stream processing is the act of taking a data stream through a series of operations that apply a (usually pure) function to each element in the stream. These operations are pipelined so that the output of one function is the input of the next one. By using pure functions, the processing becomes embarrassingly parallel: you can split the items of the stream into separate processes (or threads) which then perform the operations simultaneously (without the need for communicating between processes/threads). [1-4]

What can stream processing do?

Stream processing allows you to efficiently manipulate large data sets. Through the use of lazy evaluation, you can process data stream too large to fit into memory all at once.

Additionally, stream processing has several real world applications including:

  • parsing rss feeds (rss readers, think feedly)
  • combining different types data from multiple sources in innovative ways (mashups, think trendsmap)
  • taking data from multiple sources, manipulating the data into a homogeneous structure, and storing the result in a database (extracting, transforming, and loading data; aka ETL, data wrangling...)
  • aggregating similarly structured data from siloed sources and presenting it via a unified interface (aggregators, think kayak)

[5, 6]

Stream processing frameworks

If you've heard anything about stream processing, chances are you've also heard about frameworks such as Spark, Storm, Kafka, Samza, and Flink. While popular, these frameworks have a complex setup and installation process, and are usually overkill for the amount of data typical python users deal with. Using a few examples, I will show basic Storm usage and how it stacks up against BASH.

Introducing riko

Supporting both Python 2 and 3, riko is the first pure python stream processing library to support synchronous, asynchronous, and parallel processing. It's built using functional programming methodology and lazy evaluation by default.

Basic riko usage

Using a series of examples, I will show basic riko usage. Examples will include counting words, fetching streams, and rss feed manipulation. I will highlight the key features which make riko a better stream processing alternative to Storm and the like.

riko's many paradigms

Depending on the type of data being processed; a synchronous, asynchronous, or parallel processing method may be ideal. Fetching data from multiple sources is suited for asynchronous or thread based parallel processing. Computational intensive tasks are suited for processor based parallel processing. And asynchronous processing is best suited for debugging or low latency environments.

riko is designed to support all of these paradigms using the same api. This means switching between paradigms requires trivial code changes such as adding a yield statement or changing a keyword argument.

Using a series of examples, I will show each of these paradigms in action.

Data Mining and Processing for fun and profit

AUDIENCE

  • data scientists (current and aspiring)
  • those who want to know more about data mining, analysis, and processing
  • those interested in functional programming

DESCRIPTION

Data mining is a key skill that involves transforming data found online and elsewhere from a hodgepodge of numbers into actionable information. Using examples ranging from RSS feeds, open data portals, and web scraping, this tutorial will show you how to efficiently obtain and transform data from disparate sources.

ABSTRACT

Data mining is a key skill that any self proclaimed data scientist should possess. It involves transforming data from disparate sources and a hodgepodge of numbers into actionable information. Tabular data, e.g., csv/excel files, is very common in data mining and greatly benefits from python's functional programming idioms. For better or for worse, the leading python data libraries, Numpy and Pandas, eschew the functional programming style for object-oriented programming.

Using examples ranging from RSS feeds, the South Africa Data Portal API, raw excel files, and basic web scraping, this tutorial will show how to efficiently locate, obtain, transform, and remix data from the web. These examples will prove that you can do a lot with functional programming and without the need for Numpy or Pandas.

Finally, it will introduce meza: a pure python, functional, data analysis library and alternative to Pandas.

IPython notebooks and sample data files will be distributed beforehand on Github to facilitate code distribution.

OBJECTIVES

Attendees will learn what data and data mining are, why they are important. They will learn some basic functional programming idioms and see how it is ideally suited to data mining. They will also see in what areas the 20lb gorilla (Pandas) shines and when a lightweight alternative (meza) is more practical.

ADDITIONAL INFO

Level

Intermediate

Prerequisites

Students should have at least basic knowledge of python itertools and functional programming paradigms, e.g., map, filter, reduce, and list comprehensions.

Laptops should have python3 and the following pypi libs installed: bs4, requests, and meza.

Format

Students will be instructed in the completion of a series of exercises that will explore using python for data mining. It will involve lessons to introduce concepts; demos which implement the concepts using meza, beautiful soup, and requests; and exercises for students to apply the concepts.

OUTLINE

  • [10 min] Part I
    • [2 min] Intro (lecture)
      • Who am I?
      • Topics to cover
      • format
    • [8 min] Definitions (lecture)
      • What is data?
      • What is data mining?
      • Why is it data mining important?
  • [35 min] Part II
    • [15 min] You might not need pandas (demo)
      • Obtaining data
      • Analyzing and Transforming data
    • [20 min] interactive data gathering (exercise)
  • [45 min] Part III
    • [10 min] Introducing meza (demo)
    • [20 min] interactive data processing (exercise)
    • [15 min] Q&A