Data Mining and Processing for fun and profit

Speaker: Reuben Cummings

Type: Tutorial

Room: Tugela Room

Time: Oct 07 (Fri), 13:45

Duration: 1:30

AUDIENCE

  • data scientists (current and aspiring)
  • those who want to know more about data mining, analysis, and processing
  • those interested in functional programming

DESCRIPTION

Data mining is a key skill that involves transforming data found online and elsewhere from a hodgepodge of numbers into actionable information. Using examples ranging from RSS feeds, open data portals, and web scraping, this tutorial will show you how to efficiently obtain and transform data from disparate sources.

ABSTRACT

Data mining is a key skill that any self proclaimed data scientist should possess. It involves transforming data from disparate sources and a hodgepodge of numbers into actionable information. Tabular data, e.g., csv/excel files, is very common in data mining and greatly benefits from python's functional programming idioms. For better or for worse, the leading python data libraries, Numpy and Pandas, eschew the functional programming style for object-oriented programming.

Using examples ranging from RSS feeds, the South Africa Data Portal API, raw excel files, and basic web scraping, this tutorial will show how to efficiently locate, obtain, transform, and remix data from the web. These examples will prove that you can do a lot with functional programming and without the need for Numpy or Pandas.

Finally, it will introduce meza: a pure python, functional, data analysis library and alternative to Pandas.

IPython notebooks and sample data files will be distributed beforehand on Github to facilitate code distribution.

OBJECTIVES

Attendees will learn what data and data mining are, why they are important. They will learn some basic functional programming idioms and see how it is ideally suited to data mining. They will also see in what areas the 20lb gorilla (Pandas) shines and when a lightweight alternative (meza) is more practical.

ADDITIONAL INFO

Level

Intermediate

Prerequisites

Students should have at least basic knowledge of python itertools and functional programming paradigms, e.g., map, filter, reduce, and list comprehensions.

Laptops should have python3 and the following pypi libs installed: bs4, requests, and meza.

Format

Students will be instructed in the completion of a series of exercises that will explore using python for data mining. It will involve lessons to introduce concepts; demos which implement the concepts using meza, beautiful soup, and requests; and exercises for students to apply the concepts.

OUTLINE

  • [10 min] Part I
    • [2 min] Intro (lecture)
      • Who am I?
      • Topics to cover
      • format
    • [8 min] Definitions (lecture)
      • What is data?
      • What is data mining?
      • Why is it data mining important?
  • [35 min] Part II
    • [15 min] You might not need pandas (demo)
      • Obtaining data
      • Analyzing and Transforming data
    • [20 min] interactive data gathering (exercise)
  • [45 min] Part III
    • [10 min] Introducing meza (demo)
    • [20 min] interactive data processing (exercise)
    • [15 min] Q&A