This was a live summary and has not been reviewed by the speakers.
Guido Kollerie: Data Analysis with Pandas
Guido works for a company that deals with a lot of questionnaires and other big datasets. They used to process this data in Oracle, with a lot of complex queries that didn’t work very well for them.
As they were using Python, Guido started looking at Pandas, a library that helps you handle data. It’s built on top of numpy. It supports Python 3. Dependencies are not so well registered, so installation is not entirely trivial.
Series is one of Panda’s data structures. Guido recommends using iPython notebook, for easy interaction with the data. Series are basically similar to Python dicts. Noteworthy is that in Panda, values are always strongly typed.
You can select a row on it’s key, like a dictionary, but also by it’s index or slice, like a list. However, in a similar way you can also do advanced queries against your series. Powerful in Pandas is that you can easily apply operations to your entire dataset, like multiplying each row with a particular integer, or something more flexible with a custom Python function. You can also use this to mix multiple series. Pandas handles cases like series of different length very well too, reducing error handling code needed.
Dataframes are basically a combination of series. It’s similar to a matrix type. Again, everything is strongly typed. Like with series, rows can be indexed by plain integers, or you can provide your own indexes, like in a dict.
Reading input and running queries
Dataframes can be constructed manually, but it’s also able to read CSV files or other formats directly into dataframes. Recently, support for reading dataframes from SQL was added as well. Any dataframe can be output as CSV or Excel as well, so it’s also possible to push data back into SQL.
On a dataframe, you can pick out a particular column or row quite easily. It’s similar to selections on series, but then a lot more powerful. The result of an operation on series can also be added to a new column in a dataframe.
Dataframes are somewhat similar to tables, and Pandas also supports some operations that can be done on SQL tables. For example, joins can be done with the
Guido shows us an example using the number of students at various universities. Reading from CSV, he first uses
rename to rename all the columns to their lowercase form, and filters a subselection of columns. Like SQL, Pandas also supports
groupby on dataframes, similar to
GROUP BY in SQL, including aggregate functions using
To make a nice graph, the data will need to be sorted. Dataframes have a fairly flexible
sort() function. You define the sorting in Pandas using the
0 is for rows,
1 is for columns. This can be a little confusing at first. It takes a total of about six lines, from CSV file to nice looking graphs.
Next, we look at the differences between men and women at various studies. This also takes only a handful of lines: group by the name, e.g. psychology, calculate the difference between the sum of men and women, sort on the biggest difference, take the top and bottom five, and plot. And we have a plot with the ratio of men to women for the most extreme differences.
The example Guido shows is just a very tiny dataset, but it also performs very well on larger datasets. Queries on even millions of rows are nearly instant.
Other examples Guido shows, with just a handful of lines, is showing all different computer science studies and how many universities offer them, or pivoting the angle and grouping of data. This last feature can also be used for stacking and unstacking data. And there are many many more features to make working with your data much simpler and much faster, and save a lot of boilerplating and error handling code.
The documentation can be a little brief. Guido recommends “Python for Data Analysis: Data Wrangling with Pandas, NumPy and iPython”.
Sasha Romijn: Clean Python code
I have yet to master the skill of making live summaries of my own sessions. However, the gist is quite simple: use your brain, don’t blindly follow rules and buy this awesome book.
Besma Mcharek: On learning how to combine flask, open data, raspberry pi and google-maps
Besma started as a Python developer, but also picked up Rails later. She was inspired to look at flask after the last PyLadies Amsterdam meetup. Flask can easily run on a RaspBerry Pi, making it very simple for a newbie to create a basic web application, hosted on affordable hardware.
Besma also teaches Python to random citizens who are eager to learn. Being a on a limited budget, Raspberry PI’s are great test platforms for her students. She was also inspired to build this application by a new Google Maps extension to Flask, and working on making technology and code more understandable to civil servants working on publishing open data.
In her application, she uses the
flask-googlemaps package, along with flask itself. The extension makes it very simple to include a basic Google map inside, without requiring all the Google Maps boilerplating. She wanted to use data of something that really matters to her. She has a four year old daughter, and will have to choose a school for her soon. Using the existing open data for this was non-trivial, as the encoding was not utf-8. In the end, she managed to translate this to a basic table in a flask application as a first step.
Next was to display the data on a Google map. With the Google maps extension, this wasn’t very difficult. There are however 448 schools, including universities. So the first step is to filter on primary schools, of which there are 353. That’s still a lot, so selecting on nearby schools is a good next step. That’s not supported by Google Maps though, so she had to do it manually. As this only covers a very small area, she could stick to a simple algorithm for this. That filters it down far enough to be usable
She really enjoyed writing this, and it really made it a lot simpler to make a choice for a school – which would have been impossible with just the CSV.