Dave Guarino on technology, government, and change

Just Dropped In: Using interactive breakpoints in Ruby, Python and Javascript to make life better

I pushed my soul in a deep dark hole and then I followed it in
I watched myself crawlin' out as I was a-crawlin' in
I got up so tight I couldn't unwind
I saw so much I broke my mind
I just dropped in to see what condition my condition was in

- Kenny Rogers

Sometimes, when you're early in programming in a new language, darkness washes over you — you've written 70% of the code to do some thing, but then you don't know the syntax to do that next step.

In these cases, it's great to use a breakpoint — a line of code that stops your code there, and lets to interactively type lines with the full context of the code above, but with immediate feedback (this interactive prompt is called a "REPL").

You've probably used a REPL before — it's what happens when you go to a terminal and type irb (Ruby), python, or node without any arguments. In a REPL, you type code, and immediate see what that code returns.

By adding breakpoints, we can enter that almost zen-like immediate feedback loop at any point we want in our code.

Because I love using interactive breakpoints to debug, and often find myself working across languages, here's a quick guide on using them across Python, Ruby, and Javascript (Node).

Python: IPython embed

In Python, we can use a breakpoint with IPython's embed function.

First, make sure you have IPython installed:

pip install ipython

Then we can add a breakpoint by importing embed, and using the embed() function:

from IPython import embed
# Misc code
embed()

Ruby: Pry

In Ruby, we can add a breakpoint with the nifty pry gem.

First, make sure you have pry installed:

gem install pry

Now, require pry in your code, and drop in using binding.pry:

require 'pry'
# Misc code
binding.pry

Node.js: Debugger

In Node.js, we can add a breakpoint with the debug feature of Node.js and adding a debugger line.

The process is slightly trickier than for Ruby or Python, buuut we don't have to install anything this time!

Let's walk through how to use debugger. In our Javascript, we add debugger to create a breakpoint. So say we have simple file called hello.js with the following:

var x = 'hello';
debugger;

We can go to that breakpoint by first running the file with debug added in, i.e. node debug hello.js:

$~: node debug hello.js
< debugger listening on port 5858
connecting... ok
break in hello.js:1
  1 var x = 'hello';
  2 debugger;
  3
debug >

Now, at the debug prompt, we do two things:

  1. Type c and hit enter
  2. Type repl and hit enter
debug> c
break in hello.js:2
  1 var x = 'hello';
  2 debugger;
  3
  4 });
debug> repl
Press Ctrl + C to leave debug repl
>

Congrats! Now you're at an interactive prompt at the breakpoint.

An example: learning how to deal with GeoJSON in Python

Let's say you're exploring the wild and crazy world of open geo, and you're super stoked to play around with GeoJSON — a JSON data format for storing geographic data — and you want to do it in Python.

You're given a GeoJSON file with a list of bars in Oakland where you might get to hear the Descendents, and you want write a Python script to print out the name of each.

Since you're new to this whole geo game, you don't really know how GeoJSON nests its attributes, and want to be able to play with the JSON data as a Python dictionary to figure that out.

So you've written a script that loads the GeoJSON and turns it into a dictionary. (You can get this script and data from GitHub with git clone https://gist.github.com/9a6fc83d1a67939c5110.git)

from IPython import embed
import json

geodata_json_string = open('good_oakland_bars.geojson').read()
geodata_dict = json.loads(geodata_json_string)

embed()

Now, when we run python py_interactive_breakpoint.py, we'll be dropped to an interactive breakpoint to play around with the geodata_dict variable.

By playing around in the REPL — most simply by typing geodata_dict and seeing what it looks like — we figure out that geodata_dict['features'] is an array of the bars, and that GeoJSON stores the non-geographic attributes in a properties dictionary.

So we realize we can access the name of the first bar by doing geodata_dict['features'][0]['properties']['name'] and this opens the door to a nifty for loop to print out those names.

We can even write try writing that for loop in the REPL first, and if it works, we can then migrate it over to our file.

Ta-da! REPL-driven-development, yo.

Parting thought

When coding, always remember the words of The Stranger — sometimes you eat the bar, and sometimes the bar, well, it eats you.

So when you feel like the bar's eating you, try using a breakpoint.

And if that doesn't help, you can always say "fuck it" and write a Lebowski-themed blog post.

If you found this helpful, spotted a problem, or have additional thoughts, I'd love to hear from you at @allafarce or by e-mail.

ETL for America

Now a few months out the intense tunnel of my Code for America fellowship year, I've had a bit more time and mental space to sip coffee by Lake Merritt and reflect on issues of technology and government.

The "experience problem"

A mantra oft-repeated at CfA is that we're "building interfaces to government that are simple, beautiful, and easy to use."

And that should be a core concern: bringing human-centered design and user experience thinking to the interfaces that government presents is important work. Governments all too often tend to privilege legal accuracy over creating an accessible, enjoyable experience for its constituents.

But this "experience problem" is really only one piece of the civic tech puzzle.

Data integration: a whole other can of worms

Many of the problems government confronts with technology are fundamentally about data integration: taking the disparate data sets living in a variety of locations and formats (SQL Server databases, exports from ancient ERP systems, Excel speadsheets on people's desktops) and getting them into a place and shape where they're actually usable.

Among backend engineers, these are generically referred to as ETL problems, or extract-transform-load operations. The notion is that integrating data involves three distinct steps:

  1. Extract: getting data out of some system where it is stored, and where updates are made

  2. Transform: reformatting and reshaping the data in ways that make it usable

  3. Load: putting the transformed data into another system, generally something where analyzing it or combining it with other data is easy for end-users

Let's look at an example:

The Mayor's staff wants to put a simple dashboard on the City web site with building permits data. They'd like a map and some simple counts to provide a lens on local economic development to residents.

  • Building permits are put into software procured in 2002 called PermitManager. IT staff write a script that nightly runs permit_manager_export.exe, which dumps the data (permits.csv) to a shared drive. (extract)

  • The permit data system only contains addresses in raw text, but to map it it needs latitude and longitudes. The GIS team writes a script that every morning takes permits.csv and adds latitude and longitude columns based on the address text. (transform)

  • The City has an open data portal that can generate a web map for any data set on it containing latitude and longitude. Staff write a script that uploads permits-with-latitude-and-longitude.csv to the open data portal every afternoon, and embed the auto-generated web map into the city's web site. (load)

I've explained ETL in the above way plenty of times, and the thing is, almost everyone I talk to finds it easy to understand. They just hadn't thought about it too much.

And one of the foibles here is that many government staff -- particularly those at the high level -- lack the basic technical language to be able to understand the structure of the ETL problem and find and evaluate the resources out there.

The fact that I can go months hearing about "open data" without a single mention of ETL is a problem. ETL is the pipes of your house: it's how you open data.

ETL: a hard ^#&@ing problem

Did you notice in the above example that I have three mentions of city staff writing scripts? Wasn't that weird? Why didn't they use some software that automatically does this? If you Google around about ETL, perhaps the most common question is whether one should use some existing ETL software/framework or just write one's own ETL code from scratch.

This is at the core of the ETL issue: because the very problem of data integration is about bringing together disparate, heterogeneous systems, there isn't really a clear-winner, "out-of-the-box" solution.

Couple with the fact that governments seem to have an almost vampiric thirst for clearly market-dominating, "enterprise" solutions -- c.f., the old adage that "no one ever got fired for hiring IBM" -- and you find yourself confronting a scary truth.

What's more, ETL is actually just an intrinsically hard technical problem. Palantir, a company which is very good with data, essentially solves the problem by throwing engineers at it. They have a fantastic analytic infrastructure at their core, and they pay large sums of money to very smart people to do one thing: write scripts to get clients' data into that infrastructure.

What is to be done?

First a note on what not to do: do not try to buy your way out of this. There is no single solution, no single piece of software you can buy for this. Anyone who tells you otherwise is is being disingenuous. And if you pay someone external to integrate 100% your data right now, you will be paying them in 11 days when you change one tiny part of your system. And I bet it will be at a mark-up.

Here are a few paths forward I see as promising:

  • Build internal capacity: Hire smart, intellectually curious people who learn what they need to know to solve a problem. In fact, don't even start by hiring. Many of these people probably already work with you, but are hobbled by the inability to, say, install new software on their desktop, or by cultural norms that make trying out new things unacceptable.

Because data integration is a hard problem with new challenges each time you approach it, the best way to tackle them is to have motivated people who love to solve new problems, and give them the tools they need (whether that's training, software, or a supervisorial mandate.) To borrow and modify a phrase: let them run Linux.

As Andrew Clay Shafer has said: if you're not building a learning organization, you're losing to one. And I can tell you, governments are, for the most part, not building learning organizations at the moment.

  • Explore the resources out there: I've started putting together a list of ETL resources for government. I'd love contributions. With just the knowledge of the acronym "ETL" and the basics of what it means you can start to think about how you can solve your own data problems with smaller tools (Windows Job Scheduler is analogous to In-N-Out's secret sauce.)

Because it's a generic (not just government) problem, there's also plenty of other resources out there. The data journalism folks have done a great job of writing tutorials that make existing tools accessible, and we need to follow suit (and work with them.)

  • Collaborate for God's sake!: EVERY organization dealing with data is dealing with these problems. And governments need to work together on this. This is where open source presents invaluable process lessons for government: working collaboratively, and in the open, can float all boats much higher than they currently are.

Whether it's putting your scripts on GitHub, asking and answering questions on the Open Data StackExchange, or helping out others on the Socrata support forums, collaboration is a key lever for this government technology problem.

Wanted: government data plumbers

In a forthcoming post, I hope to document some of the concrete experiences and lessons I've had in doing data plumbing in government, most recently some of the exciting experiments we're running in Oakland, California.

But I'm also writing this blog post -- perhaps a bit audaciously -- as a call to arms: all of us doing data work inside government need to start writing more publicly about our processes, hacks, and tools, and collaborating across boundaries.

From pure policy wonks who know just enough VBA to get stuff done to the Unix geeks whose awk knowledge strikes fear into the hearts of most sysadmins, we need to communicate more, and more publicly.

I've coded for America. It was hard, hard work, but incredibly fulfilling. So to my fellow plumbers I say: let's ETL for America.

If these ramblings piqued your interest, I'd love to hear from you at @allafarce or by e-mail.

Hello, world.

This world is made of paper.
-Thunderbirds Are Now!

This is a test of the GitHub Pages Broadcast System. Please do not be alarmed.