I400/590 : Large-Scale Social Phenomena - Data Mining Demo
Introduction
For your mid-term hack-a-thons, you will be expected to quickly acquire, analyze and draw conclusion from some real-world datasets. The goal of this tutorial is to provide you with some tools that will hopefully enable you to spend less time debugging and more time generating and testing interesting ideas.
Here, I chose to focus on Python. It is beautiful language that is quickly developing an ecosystem of powerful and free scientific computing and data mining tools (e.g. the Homogenization of scientific computing, or why Python is steadily eating other languages' lunch). For this reason, as well as my own familiarity with it, I encourage (though certainly not require) you to use it for your mid-term hack-a-thons. From my own experience, getting comfortable with these tools will pay off in terms of making many future data analysis projects (including perhaps your final projects) easier & more enjoyable.
Hopefully you already have Python installed. If you are new to it, search around for good introductory tutorials -- I'd say it has a forgiving learning curve, comparatively speaking.
IPython
IPython is a kind of add-on for Python that brings several improvements, most importantly for us are its interactive, graphical notebooks which provide a great way to quickly develop and share code. A gallery of interesting notebooks is provided here, including:
Libraries
In this demo, we will demonstrate the basic functionality of several useful toolboxes, including:
You can find more useful links at Python for data analysis: the landscape of tutorials.
Data
Even simple analysis can extract interesting results from good data, but nothing can make up for bad data. There's a few places to find potential data sets, including publicly-available data sources (Tableau software provides some data sets, and there is a directory of APIs and data sources at ProgrammableWeb).
Another common strategy is to scrape data from the web. There is an automatic tool to do so built by Kimono (which I haven't used but looks impressive). Python has several tools to do this (many discussed here). We will focus on the combination of mechanize (which simulates a browser to download HTML) and BeautifulSoup (which parses the downloaded HTML). There are some good tutorials showing how to use mechanize and Beautiful here and here.
Installation
There are two ways to install the aforementioned tools. The first is to use a Python distribution that already comes with all of these included, such as Anaconda.
The second option is to use the Python installer. Once you have Python installed, run the following on the command line:
You may want to run this as administrator to install them system-wide. If this options doesn't work for you, I recommend trying Anaconda.
Demo
|
No comments:
Post a Comment