Thoughts on joining Cloudera

After some unanticipated media leaks (here and here), I was very excited to finally share that my team and I are joining Cloudera. You can find out all the concrete details in those articles, but I wanted to give a bit more intimate perspective on the move and what we see in the future inside Cloudera Engineering.

Chang She and I conceived DataPad in 2012 while we were building out pandas and helping the PyData ecosystem get itself off the ground. I was writing a book and every 6 weeks or so we were cranking out another pandas release and watching the analytics ecosystem evolve. We saw a clear need for a next-generation business intelligence / visual analytics product, and set about getting the resources to build it. Many BI products in the ecosystem are designed to be a visualization and reporting layer for a database of some kind, and typically interact with the data store via SQL. 10 years ago, this was perfectly adequate for most users, but now in the 2010s, more and more businesses are grappling with business data spread across many different cloud silos, and the amount of planning and ETL work needed to make an existing BI system “fit” is frequently prohibitive.

There are droves of data visualization and reporting companies building “new BI” for the web / cloud, but using in many cases some variant of the “it runs SQL queries against your database” architecture. We made an early decision to design a new architecture to enable the BI process to be more agile and iterative, sort of a “bring all your data, and figure things out as you go”. In addition to our visual web interface (you can think of it as a sort of “Google Docs for Visual Analytics”), this resulted in our building substantial novel backend infrastructure for data management and analytics. We also decided early that we wanted to push the limits of speed and interactivity of working with larger data sets. The system we built, code-named Badger, delivered an interactive analytics experience that exceeded our own expectations in performance and truly delighted our users.

I’ve been following Cloudera’s engineering efforts with great interest over the last few years, especially after the launch of Impala in 2012, and as I predicted they are continuing to lead the way in high performance analytics in the Hadoop ecosystem. As Cloudera had been a long-time supporter and fan of our work on pandas and DataPad, it became clear in our periodic catch-ups that we had been thinking about and tackling similar backend and distributed systems problems.

As a major inflection point in DataPad’s lifespan approached (forthcoming GA, paying customers, more venture capital), we took a hard look at the pressing technology problems and where we could make the most impact. As big a deal as collaboration and beautiful, functional design for visual analytics are, it was clear to us that what’s holding back next-generation BI/visual analytics is much more on the data management and systems side. So the question came down to: continue building a standalone data product and duke it out in the marketplace, or use our systems expertise to accelerate the rising tide that will lift all ships (i.e. benefit all BI / analytics vendors). In the latter case, Cloudera is clearly the place to be.

Systems engineering problems aside, we have done a lot of work on improving developer tooling for data work, and productivity and user happiness with these new technologies continue to be a major interest for us. The Cloudera team recognizes the importance of better tooling and has already made substantial open source investments in this area (Crunch, Oryx, and Impyla, to name a few). On this front, we’re really looking forward to working together; there is a lot of work to be done to enable developers and data scientists to get value out of their data faster and more intuitively.

Strata NYC 2013 and PyData 2013 Talks

I was excited to be able to talk at two recent data-centric conferences in New York. They touch on some related subjects, with the PyData talk being a lot more technical and having to do with low-level architecture in pandas and engineering work I’ve been doing this year at DataPad.

Before anyone yells at me, I’m going to revisit the PostgreSQL benchmarks in my PyData talk at some point as the performance would be a lot better with the data stored in fully-normalized form (a single fact table with all primitive data types and auxiliary tables for all of the categorical / string data. Laborious but the way a DBA would set it up rather than having a single table with VARCHAR columns).

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013) from wesm

Building Better Analytics Workflows (Strata-Hadoop World 2013) from wesm

PyCon Singapore 2013

I was graciously invited to give the keynote presentation at this year’s PyCon Singapore. Luckily, I love to hack on long plane rides. See the slides from the talk below. I showed some analytics on Python posts on Stackoverflow during the talk, here is the IPython notebook. The raw data is right here.

I also gave a half day pandas tutorial, here is the IPython notebook. You will need the data to do it yourself, here’s a download link.

PyCon Singapore 2013 Keynote from wesm

I’m moving to San Francisco. And hiring

PyCon and PyData 2013 were a blast this last week. Several people noted that my GitHub activity on pandas hasn’t quite been the same lately and wondered if I was getting a little burned out. I’m happy to say quite the opposite; I’ll still be involved with pandas development (though not the 80 hours/week of the last 2 years), but I’m starting an ambitious new data project that I’m looking forward to sharing later this year. This endeavor is also taking me from New York to San Francisco. I’m sad to be leaving the (vibrant and growing) New York data community, but also looking forward to spending more time with my data hacking friends in the Bay Area.

While I can’t share too many details about the new startup, anyone who knows me knows my passion for innovation in data tooling and making people and organizations more productive in their data analysis. pandas was Act 1! So, I’m building a world class team of engineers, designers, and forward thinkers to join me in this effort. Are you one of them? If so I look forward to hearing from you (my e-mail address can be easily located on GitHub):

Front-end and Data Visualization Engineer

You will be building a richly featured web application that will stretch the capabilities of modern web browsers. You will build core UI components and work with the UX designer and backend team to make everything work seamlessly.

  • Extensive Javascript and CSS experience.
  • Experience with one or more SVG or Canvas-based visualization toolkits, i.e. D3.js, or a keen interest in learning. Maybe you’ve spent a lot of time on http://bl.ocks.org. Extra points if you have opinions about the Grammar of Graphics (or its implementations, like ggplot2).
  • Know the ins and outs of websockets and AJAX communications with various backend data services.
  • Understand data binding and have used MV* frameworks enough to be dangerous.
  • Prior data analysis experience (even at the Excel level) very useful.

Data Engineer

You are a pragmatic, performance-motivated cruncher of bytes who knows what it means to ship code on tight deadlines. You have high standards but are willing to make tradeoffs to get stuff done. You and I will spend a lot of time at the whiteboard talking about the nuts and bolts of data processing. Some of these things may describe you:

  • Experience building performance and latency-sensitive, data-driven analytical applications.
  • Knowledge of standard data structures and algorithms for data processing, their implementation details, and performance tradeoffs (hash tables, vectors, binary trees, sorting algorithms, etc.). Maybe you already have enjoyed reading my blog.
  • Knowledge of binary data formats, serialization schemes, compression, and other IO performance considerations. Familiar with a variety of database technology.
  • Python and C/C++ experience preferred. Extra points if you have programmed in an APL dialect (J or K/Q/Kona) or solved 100 or more problems on Project Euler.
  • You are a firm believer in unit testing and continuous integration.
  • Experience with building distributed data systems and analytics tools such as Spark, Crunch, or Pig a plus. Maybe you loved the Google Dremel white paper.
  • Experience with code generation (e.g. LLVM) or compiler technology a big plus.

UX Designer

You will be responsible for web design and crafting a compelling user experience. You will work intimately with the front-end / data visualization team to make a consistent look and feel throughout the product. We would prefer a designer who can also do her/his own JS/CSS implementation work if necessary. Any experience with data analysis tools, from Excel to Spotfire to Matlab (with opinions about what makes each of them easy–or terrible–to use), would be strongly preferred.

Full Stack Web Engineer

You will play a critical role in building a scalable, reliable web application, and generally keeping the trains running on time. You should be a jack of many trades with an interest in learning many more. Here are some desirable qualities:

  • Extensive experience using Python, Javascript (Node.js), etc. to build scalable, data-intensive web applications, and with strong opinions about the right technology to use.
  • Comfort with managing continuous (or highly frequent) deployments.
  • Experience with using and managing SQL (e.g. Postgres) and NoSQL (e.g. MongoDB) databases.
  • Experience building applications EC2 or other cloud computing services.

Product Lead

You will run product at the company, working to gain a deep understanding of customer use cases and working with the engineering team to drive product-market fit. Prior experience in the data analytics or business intelligence space would be very helpful.

Whirlwind tour of pandas in 10 minutes

10-minute tour of pandas from Wes McKinney on Vimeo.

Update on upcoming pandas v0.10, new file parser, other performance wins

We’re hard at work as usual getting the next major pandas release out. I hope you’re as excited as I am! An interesting problem came up recently on the ever-popular FEC Disclosure database used in my book and in many pandas demos. The powers that be decided it would be cool if they put commas at the end of each line; fooling most CSV readers into thinking there are empty fields at the end of each line:

pandas’s file parsers by default will treat the first column as the DataFrame’s row names if the data have 1 too many columns, which is very useful in a lot of cases. Not so much here. So I made it so you can indicate index_col=False which results on the last column being dropped as desired. The FEC data file is now about 900MB and takes only 20 seconds to load on my spinning-rust box:

For reference, it’s more difficult to load this file in R (2.15.2) (both because of its size and malformedness– hopefully an R guru can tell me how to deal with this trailing delimiter crap). Setting row.names=NULL causes incorrect column labelling but at least gives us a parsing + type inference performance number (about 10x slower, faster if you specify all 18 column data types):

If you know much about this data set, you know most of these columns are not interesting to analyze. New in pandas v0.10 you can specify a subset of columns right in read_csv which results in both much faster parsing time and lower memory usage (since we’re throwing away the data from the other columns after tokenizing the file):

Outside of file reading, a huge amount of work has been done elsewhere on pandas (aided by Chang She, Yoval P, Jeff Reback, and others). Performance has improved in many critical operations outside of parsing too (check out the groupby numbers!). Here’s the output of a recent vbench run showing the latest dev version versus version 0.9.0 (numbers less than 1 indicate that the current pandas version is faster on average by that ratio):

A new high performance, memory-efficient file parser engine for pandas

TL;DR I’ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn’t been released yet (it’s in a branch on GitHub) but will after I give it a month or so for any remaining buglets to shake out:

A project I’ve put off for a long time is building a high performance, memory efficient file parser for pandas. The existing code up through and including the imminent pandas 0.9.0 release has always been makeshift; the development focus has been on parser features over the more tedious (but actually much more straightforward) issue of creating a fast C table tokenizer. It’s been on the pandas roadmap for a long time:

http://github.com/pydata/pandas/issues/821

pandas.read_csv from pandas 0.5.0 onward is actually very fast– faster than R and much faster than numpy.loadtxt– but it uses a lot of memory. I wrote about some of the implementation issues about a year ago here. The key problem with the existing code is this: all of the existing parsing solutions in pandas as well as NumPy first read the file data into pure Python data structures: a list of tuples or a list of lists. If you have a very large file, a list of 1 million or 10 million Python tuples has an extraordinary memory footprint– significantly greater than the size of the file on disk (can be 5x or more footprint, far too much). Some people have pointed out the large memory usage without correctly explaining why, but this is the one and only reason: too many intermediate Python data structures.

Building a good parser engine isn’t exactly rocket science; we’re talking optimizing the implementation of dirt simple O(n) algorithms here. The task is divided into several key pieces:

  • File tokenization: read bytes from the file, identify where fields begin and end and which column each belongs to. Python’s csv module is an example of a tokenizer. Things like quoting conventions need to be taken into account. Doing this well in C is about picking the right data structures and making the code lean and mean. To be clear: if you design the tokenizer data structure wrong, you’ve lost before you’ve begun.
  • NA value filtering: detect NA (missing) value sentinels and convert to the appropriate NA representation. Examples of NA sentinels are NA, #N/A or other bespoke sentinels like -999. Practically speaking this means keeping a hash set of strings considered NA and check whether each parsed token is in the set (and you can have different NA sets for each column, too!). If the number of sentinel values is small, you could use an array of C strings instead of a hash set.
  • Tolerating “bad” rows: Can aberrant rows be gracefully ignored with your consent? Is the error message informative?
  • Type inference / conversion: Converting the tokens in the file to the right C types (string, date, floating point, integer, boolean).
  • Skipping rows: Ignore certain rows in file or at end of file.
  • Date parsing / value conversion: Convert one or more columns into timestamps. In some cases concatenate date/time information spread across multiple columns.
  • Handling of “index” columns: Handle row names appropriately, yielding a DataFrame with the expected row index.

  • None of this is that hard; it’s made much more time consuming due to the proliferation of fine-grained options (and resulting “parameter hell”). Anyway, I finally mustered the energy to hack it out over a few intense days in late August and September. I’m hoping to ship it in a quick pandas 0.10 release (“version point-ten”) toward the end of October if possible. It would be nice to push this code upstream into NumPy to improve loadtxt and genfromtxt’s performance as well.

    Benchmarks against R, NumPy, Continuum’s IOPro

    Outside of parser features (i.e. “can the tool read my file correctly”), there are two performance areas of interest:

  • CPU Speed: how long does it take to parse the file?
  • Memory utilization: what’s the maximum amount of RAM used while the file is being parsed (including the final returned table)? There’s really nothing worse than your computer starting to swap when you try to parse a large file
  • I’ll compare the new pandas parser engine in a group of several tools that you can use to do the same job, including R’s parser functions:

  • R’s venerable read.csv and read.table functions
  • numpy.loadtxt: this is a pure Python parser, to be clear.
  • New pandas engine, via pandas.read_csv and read_table
  • A new commercial library, IOPro, from my good friends at Continuum Analytics.
  • To do the performance analysis, I’ll look at 5 representative data sets:

  • A 100,000 x 50 CSV matrix of randomly generated 0′s and 1′s. It looks like this:
  • A 1,000,000 x 10 CSV matrix of randomly generated normally distributed data. Looks like this:
  • The Federal election committee (FEC) data set as a CSV file. One of my favorite example data sets for talking about pandas. Here’s what it looks like when parsed with pandas.read_csv
  • Wikipedia page count data used for benchmarks in this blog post. It’s delimited by single spaces and has no column header:
  • A large numerical astronomy data set used for benchmarks in this blog post. Looks like this:
  • Here’s a link to an archive of all the datasets (warning: about 500 megabytes): Table datasets

    I don’t have time to compare features (which vary greatly across the tools).

    Oh, and my rig:

  • Core i7 950 @ 3.07 GHz
  • 24 GB of ram (so we won’t get close to swapping)
  • OCZ Vertex 3 Sata 3 SSD
  • (Because I have an SSD I would expect the benchmarks for spinning rust to differ roughly by a constant amount based on read times for slurping the bytes of the disk. In my case, the disk reads aren’t a major factor. In corporate environments with NFS servers under heavy load, you would expect similar reads to take a bit longer.)

    CPU Performance benchmarks

    So numpy.loadtxt is really slow, and I’m excluding it from the benchmarks. On the smallest and simplest file in these benchmarks, it’s more than 10 times slower than the new pandas parser:

    Here are the results for everybody else (see code at end of post):

    Here are the raw numbers in seconds:

    In [30]: results
    Out[30]:
                       iopro    pandas       R
    astro          17.646228  6.955254  37.030
    double-matrix   3.377430  1.279502   6.920
    fec             3.685799  2.306570  18.121
    wikipedia      11.752624  4.369659  42.250
    zero-matrix     0.673885  0.268830   0.616

    IOPro vs. new pandas parser: look closer

    But hey, wait a second. If you are intimately familiar with IOPro and pandas you will already be saying that I am not making an apples to apples comparison. True. Why not?

  • IOPro does not check for and substitute common NA sentinel values (I believe you can give it a list of values to check for– the documentation was a bit hard to work out in this regard)
  • IOPro returns NumPy arrays with structured dtype. Pandas DataFrame has a slightly different internal format, and strings are boxed as Python objects rather than stored in NumPy string dtype arrays
  • To level the playing field, I’ll disable the NA filtering logic (passing na_filter=False) in pandas, instruct the parser to return a structured array instead of a DataFrame (as_recarray=True). Secondly, let’s only look at the numerical datasets (exclude wikipedia and fec, for now) to exclude the impact of handling of string datatypes. Here is the resulting graph (with relative timings):

    It looks like the savings of not passing all the tokens through the NA filter is balanced by the cost of transferring the column arrays into the structured array (which is a raw array of bytes interpreted as a table by NumPy). This could very likely be made faster (more cache-efficient) than it currently is with some effort.

    Memory usage benchmarks

    Profiling peak memory usage is a tedious process. The canonical tool for the job is Massif from the Valgrind suite. I’m not yet done obsessing over memory allocation and data management inside the parser system, but here’s what the numbers look like compared with R and IOPro. I’m using the following valgrind commands (plus ms_print) to get this output (if this is not correct, please someone tell me):

    I’ll use the largest file in this post, the astro numerical dataset.

    First, IOPro advertises very low memory footprint. It does not, however, avoid having 2 copies of the data set in memory (I don’t either. It’s actually very difficult–and costly–to avoid this). Here is the final output of ms_print showing peak memory usage at the very end when the structured array is created and returned:

    Let’s look at R. Peak memory allocation comes in slightly under IOPro at 903MM bytes vs. 912MM:

    In the new pandas parser, I’ll look at 2 things: memory allocation by the parser engine before creation of the final DataFrame (which causes data-doubling as with IOPro) and the user-facing read_csv. First, the profile of using read_csv (which also creates a simple integer Index for the DataFrame) uses 1014MM bytes, about 10% more than either of the above:

    Considering only the parser engine (which returns a dict of arrays, i.e. no data doubling) uses only 570MM bytes:

    Memory usage with non-numerical data depends on a lot of issues surrounding the handling of string data. Let’s consider the FEC data set, where pandas does pretty well out of the box, using only 415MM bytes at peak (I realized why it was so high while writing this article…will reduce soon):

    IOPro out of the box uses 3 times more. This would obviously be completely undesirable:

    What about R? It may not be fast but it uses the least memory again:

    You might be wondering why IOPro uses so much memory? The problem is fixed-width string types:

    Oof. Dtypes of S38 or S76 means that field uses 76 bytes for every entry. This is not good, so let’s set a bunch of these fields to use Python objects like pandas:

    adap = iopro.text_adapter('P00000001-ALL.csv')
    adap.set_field_types({2: object, 3: object, 4: object,
                          7: object, 8: object, 13: object})
    arr = adap[:]

    Here’s the Massif peak usage which is reasonably inline with pandas:

    Conclusions

    I’m very happy to see this project to completion, finally. Python users have been suffering for years from parsers that have 1) few features, 2) are slow, and 3) use a lot of memory. In pandas I focused first on features, then on speed, and now on both speed and memory. I’m very pleased with how it turned out. I’m excited to see the code hopefully pushed upstream into NumPy when I can get some help with the integration and plumbing (and parameter hell).

    It will be a month or so before this code appears in a new release of pandas (we are about to release version 0.9.0) as I want to let folks on the bleeding edge find any bugs before releasing it to the masses.

    Future work and extensions

    Several things could (should) be added to the parser without too much effort comparatively:

  • Integrate a regular expression engine to tokenize lines with multi-character delimiters or regular expressions.
  • Code up the fixed-width-field version of the tokenizer
  • Add on-the-fly decompression of GZIP’d files
  • Code used for performance and memory benchmarks

    R code (just copy-pasted the output I got of each command). Version 2.14.0

    Requirements for EuroSciPy 2012 pandas tutorial

    Much belatedly, here are the requirements for the EuroSciPy tutorial on pandas in Brussels tomorrow. They are the same as for the SciPy 2012 (Austin) tutorial:

  • NumPy 1.6.1 or higher
  • matplotlib 1.0 or higher
  • IPython 0.13 or higher, HTML notebook dependencies
  • pandas 0.8.1 (or better, GitHub master revision) and dependencies (dateutil, pytz)
  • One of the easiest ways to get started from scratch is with EPDFree and then installing pandas 0.8.1 or higher. If you’re using 0.8.1 we may run into a few minor bugs that I have fixed since the last release.

    Finally, GitHub listens! The importance of attention to detail in UX

    I like using GitHub, as do apparently a lot of other people. When doing pull requests, I’ve been annoyed for a long time by the amount of clicking necessary to get the git URL to the contributor’s repository (which must be added as a remote, etc.):

  • Click on new contributor
  • Locate pandas fork on their list of repositories
  • Copy git link from their fork, git remote add and I’m in business
  • But, GitHub is about using git, so why should I have to go fishing to get the git link? Earlier in the Pull Request UI, there was a help box that contained the git url (which I used), but then they took it away! So fishing it was. This might seem very minor, but after adding many dozens of remotes like that the annoyance had accumulated. It seemed obvious to me that the branch names should just be links:

    Each time GitHub tweaked the UI, I complained on Twitter, like so:


    Fffffff @ changes the pull request UI again but the branch names still are not links to the repository?
    @wesmckinn
    Wes McKinney

    Finally, someone noticed:


    @ Great idea. Links added as of a few minutes ago.
    @cameronmcefee
    Cameron McEfee

    Small victories, I guess. Enjoy.

    Latest Table of Contents for Python for Data Analysis

    Making some progress on Python for Data Analysis