Don’t sell on Amazon

Selling your stuff on Amazon is a losing game, and I don’t recommend that you do it.

Let me tell you what happened to me recently.

I wasn’t using my iPad so I decided to sell it on Amazon.

Boom, sold. February 10 it sells, I ship with Shyp same day.

23 days later, the buyer requests to return it to me with the comments “Perfectly described and in great condition. Was shipped fast. But had to return since I received another one from work.”.

After a couple of days we agree to a 10% restocking fee. Then I get the following note from the buyer:

Thanks for authorizing. That’s good news. However, I totally messed up the return. I tried Shyp since that’s how you sent it and Shyp seemed like a great service. So I clicked on the Online Purchase Return on Shyp, entered the order number for Amazon, and then a Shyp Hero came to pick up the iPad Mini 2. Off it went.

After a while, I realized that they probably don’t know to send it to YOU. So I checked with them and lo behold, they sent it to the Amazon Return Center in Las Vegas. I contacted Amazon about it, and Amazon said that they understand and they’re going to return the item back to me, the sender, but it will take 4 to 6 weeks and I’ll have to contact you to see if this is okay.

Sorry I thought this would be easier. Please let me know if this is okay with you.



Frustrated, and knowing I have zero leverage as a seller on Amazon, I consent and state that I’ll refund the buyer as soon as I receive the iPad back a couple months later.

Lo and behold, 2 weeks later, the buyer requests an A-to-Z Guarantee Claim demanding a refund, stating:

We have received a claim under the A-to-z Guarantee program for the order [ORDER NUMBER REDACTED] because the buyer in this transaction believes the item(s) sent were materially different from how they were listed. This could mean that the item(s) are defective or damaged, that the wrong item(s) were sent, or that the item condition and/or details are different from what was represented in the product listing

The buyer’s comments were “the customer requested a return mailing label and the seller approved the request and the customer sent the ipad to our returns center in Las Vegas and it was received on 3/11. The customer called and was told that we would send the ipad back to him in 3-5 days and he has not received the ipad back from us. He shipped the ipad with a service called Shyp which is affiliated with Fed Ex and the tracking number is [TRACKING NUMBER REDACTED].
the customer will be moving in the next week and would like to confirm that the ipad is sent to the correct address:

I wrote back, “Yes, I will refund as soon as I receive the iPad back.” A week passes and Amazon goes ahead and issues a refund out of my pocket.

Needless to say, this is insanity, and nothing can stop unscrupulous Amazon Marketplace buyers from exploiting you as a seller and conning you out of money they pay you. In this case, I doubt I will ever receive the iPad back, so Amazon/the buyer effectively just stole it from me.

Thank you, Amazon. Our business together on the Amazon Marketplace is complete, and I encourage all readers not to sell your stuff with Amazon either, if you actually want to get paid for it.

What’s changed

I turned 30 on Friday.

I threw a party. Nearly all my favorite people in the Bay Area (with some exceptions, travel and life and so forth) came to wish me well and be merry. It was wonderful.

The decade I just completed has been quite the personal journey, with growth and change in my perspective on life. I’ve been incredibly lucky in more ways than I can count. And it’s probably true that, as Thomas Jefferson said, “I am a great believer in luck, and I find the harder I work, and more I have of it.”

And I have worked hard. That MIT thing. Three years at the financial data mines during the financial crisis. Countless lines of code. Writing a book. Dozens of public speaking engagements. Two startups (one failed, one moderately successful). Raising venture capital. Lawyers, paperwork, legal settlements, negotiations, deals, and so forth. Hundreds of plane flights. I’ve lived in 10 distinct addresses since college graduation.

I went after the volatile life path, and I sure as hell found it. In May 15, 2011, I had just finished the first year of a statistics PhD at Duke and less than 3 weeks later I surreptitiously upended my life and found an apartment in the East Village and was back in New York, hacking away on pandas and figuring out what to do with myself for the next days, months, and years.

I try to live a life without regrets. If I feel I might regret not doing something, I usually do it, professionally or personally or otherwise. I like to be available to serendipity and spontaneity whenever possible. So, despite all the stress, turmoil, heartache, lost sleep, and so forth, I regret very little of it. In fact, I feel like I needed to put myself through these trials to learn about myself and the role of work and other vocational activities in my own happiness and long-term life goals.

Having lived in SF and New York for 7 years collectively, it’s been interesting being surrounded by a money- and achievement-oriented culture and seeing how those values resonate with my own experience. In New York, you’re walking around and the city’s screaming at you “you’re not making enough money! Oh, and that person next to you makes more than you do!”. In SF, the city says “you’re not changing the world enough! And you’re not cool enough either! Oh and, you need a couple of successful exits to not suck at life!” Add on to that the fact that, you stick around these places long enough with a decent enough memory and you accumulate all the stories of “dummies” around you making way too much money and having way too charmed and glamorous lives. It’s easy to get bitter or cynical and let these stories drag you down.

Reminds me of a story:

At a party given by a billionaire on Shelter Island, the late Kurt Vonnegut informs his pal, the author Joseph Heller, that their host, a hedge fund manager, had made more money in a single day than Heller had earned from his wildly popular novel Catch 22 over its whole history. Heller responds, “Yes, but I have something he will never have . . . Enough.”

Perspective on these things is key if being happy matters to you. It’s easy to get caught up in that and forget about what’s important.

When I just got out of college, and certainly continuing for many years after that, I was burning with passion to make my mark on the world. Other people’s naysaying and casting doubts on my efforts only fueled me further. And so I worked, and worked, and worked. Often at the expense of my own health, personal relationships, and emotional well-being.

I can hardly claim that it turned out badly. I learned a lot, mastered some skills, made some objectively awesome stuff, and met a lot of wonderful people that I probably never would have met otherwise. I wouldn’t trade those things for anything. But the more surprising thing was how I would feel, at least now in retrospect, about my professional accomplishments.

I might have thought that achieving things professionally, or making money (as was the cultural value where I lived), would be more emotionally rewarding or a more significant source of happiness. That turned out to not be true at all, at least for me. And it worries me sometimes about the people in my life who are still running full-speed on their own respective hamster wheels, in hopes that “the future” built by their present and near-term efforts will bring them sustainable happiness and the kind of life they want. As they say, how you spend your time is how you spend your life.

Sometimes people (ahem, like me) overload themselves so much that it’s not even possible to be making decisions that will necessarily lead to long-term happiness in the future.

I haven’t exactly figured it all out, and can’t claim to. To cope with my own stresses, responsibilities, and so forth, several years ago I started making an active effort to understand my own mind and emotional processes and start working through feelings and experiences closer to when they happen rather than keeping them bottled up inside (like the fine Scots-Irish stock I come from). I gave up some of my 20-year old stoicism and got a lot more touchy-feely, and let the tears flow when they need to. A fair trade, I think.

Anyone who knows me well knows that hard work and making an impact on the world are an integral part of who I am, and unlikely that I’ll deviate too much from that path. I have learned, however, that love and friendship and being available, emotionally and otherwise, to the kindred spirits in one’s life are the most reliable and long-lasting sources of happiness that I know of. We all end up in the ground eventually, and at the end of the day, for me at least, the lives-of-the-living we touch matter more than the paper trail of intellectual accomplishments we leave behind us. One of my favorite authors said it best:

“A purpose of human life, no matter who is controlling it, is to love whoever is around to be loved.”

Kurt Vonnegut, The Sirens of Titan

It’s good that I learned this relatively early in life; the perspective is useful in making decisions about how to spend our time, one of the things there’s no way to get more of. As Lao Tzu wrote, “Time is a created thing. To say ‘I don’t have time,’ is like saying, ‘I don’t want to.’”

Anyway, here’s to the next decade. I reckon it will be a good one.

Thoughts on joining Cloudera

After some unanticipated media leaks (here and here), I was very excited to finally share that my team and I are joining Cloudera. You can find out all the concrete details in those articles, but I wanted to give a bit more intimate perspective on the move and what we see in the future inside Cloudera Engineering.

Chang She and I conceived DataPad in 2012 while we were building out pandas and helping the PyData ecosystem get itself off the ground. I was writing a book and every 6 weeks or so we were cranking out another pandas release and watching the analytics ecosystem evolve. We saw a clear need for a next-generation business intelligence / visual analytics product, and set about getting the resources to build it. Many BI products in the ecosystem are designed to be a visualization and reporting layer for a database of some kind, and typically interact with the data store via SQL. 10 years ago, this was perfectly adequate for most users, but now in the 2010s, more and more businesses are grappling with business data spread across many different cloud silos, and the amount of planning and ETL work needed to make an existing BI system “fit” is frequently prohibitive.

There are droves of data visualization and reporting companies building “new BI” for the web / cloud, but using in many cases some variant of the “it runs SQL queries against your database” architecture. We made an early decision to design a new architecture to enable the BI process to be more agile and iterative, sort of a “bring all your data, and figure things out as you go”. In addition to our visual web interface (you can think of it as a sort of “Google Docs for Visual Analytics”), this resulted in our building substantial novel backend infrastructure for data management and analytics. We also decided early that we wanted to push the limits of speed and interactivity of working with larger data sets. The system we built, code-named Badger, delivered an interactive analytics experience that exceeded our own expectations in performance and truly delighted our users.

I’ve been following Cloudera’s engineering efforts with great interest over the last few years, especially after the launch of Impala in 2012, and as I predicted they are continuing to lead the way in high performance analytics in the Hadoop ecosystem. As Cloudera had been a long-time supporter and fan of our work on pandas and DataPad, it became clear in our periodic catch-ups that we had been thinking about and tackling similar backend and distributed systems problems.

As a major inflection point in DataPad’s lifespan approached (forthcoming GA, paying customers, more venture capital), we took a hard look at the pressing technology problems and where we could make the most impact. As big a deal as collaboration and beautiful, functional design for visual analytics are, it was clear to us that what’s holding back next-generation BI/visual analytics is much more on the data management and systems side. So the question came down to: continue building a standalone data product and duke it out in the marketplace, or use our systems expertise to accelerate the rising tide that will lift all ships (i.e. benefit all BI / analytics vendors). In the latter case, Cloudera is clearly the place to be.

Systems engineering problems aside, we have done a lot of work on improving developer tooling for data work, and productivity and user happiness with these new technologies continue to be a major interest for us. The Cloudera team recognizes the importance of better tooling and has already made substantial open source investments in this area (Crunch, Oryx, and Impyla, to name a few). On this front, we’re really looking forward to working together; there is a lot of work to be done to enable developers and data scientists to get value out of their data faster and more intuitively.

Strata NYC 2013 and PyData 2013 Talks

I was excited to be able to talk at two recent data-centric conferences in New York. They touch on some related subjects, with the PyData talk being a lot more technical and having to do with low-level architecture in pandas and engineering work I’ve been doing this year at DataPad.

Before anyone yells at me, I’m going to revisit the PostgreSQL benchmarks in my PyData talk at some point as the performance would be a lot better with the data stored in fully-normalized form (a single fact table with all primitive data types and auxiliary tables for all of the categorical / string data. Laborious but the way a DBA would set it up rather than having a single table with VARCHAR columns).

PyCon Singapore 2013

I was graciously invited to give the keynote presentation at this year’s PyCon Singapore. Luckily, I love to hack on long plane rides. See the slides from the talk below. I showed some analytics on Python posts on Stackoverflow during the talk, here is the IPython notebook. The raw data is right here.

I also gave a half day pandas tutorial, here is the IPython notebook. You will need the data to do it yourself, here’s a download link.

I’m moving to San Francisco. And hiring

PyCon and PyData 2013 were a blast this last week. Several people noted that my GitHub activity on pandas hasn’t quite been the same lately and wondered if I was getting a little burned out. I’m happy to say quite the opposite; I’ll still be involved with pandas development (though not the 80 hours/week of the last 2 years), but I’m starting an ambitious new data project that I’m looking forward to sharing later this year. This endeavor is also taking me from New York to San Francisco. I’m sad to be leaving the (vibrant and growing) New York data community, but also looking forward to spending more time with my data hacking friends in the Bay Area.

While I can’t share too many details about the new startup, anyone who knows me knows my passion for innovation in data tooling and making people and organizations more productive in their data analysis. pandas was Act 1! So, I’m building a world class team of engineers, designers, and forward thinkers to join me in this effort. Are you one of them? If so I look forward to hearing from you (my e-mail address can be easily located on GitHub):

Front-end and Data Visualization Engineer

You will be building a richly featured web application that will stretch the capabilities of modern web browsers. You will build core UI components and work with the UX designer and backend team to make everything work seamlessly.

  • Extensive Javascript and CSS experience.
  • Experience with one or more SVG or Canvas-based visualization toolkits, i.e. D3.js, or a keen interest in learning. Maybe you’ve spent a lot of time on Extra points if you have opinions about the Grammar of Graphics (or its implementations, like ggplot2).
  • Know the ins and outs of websockets and AJAX communications with various backend data services.
  • Understand data binding and have used MV* frameworks enough to be dangerous.
  • Prior data analysis experience (even at the Excel level) very useful.

Data Engineer

You are a pragmatic, performance-motivated cruncher of bytes who knows what it means to ship code on tight deadlines. You have high standards but are willing to make tradeoffs to get stuff done. You and I will spend a lot of time at the whiteboard talking about the nuts and bolts of data processing. Some of these things may describe you:

  • Experience building performance and latency-sensitive, data-driven analytical applications.
  • Knowledge of standard data structures and algorithms for data processing, their implementation details, and performance tradeoffs (hash tables, vectors, binary trees, sorting algorithms, etc.). Maybe you already have enjoyed reading my blog.
  • Knowledge of binary data formats, serialization schemes, compression, and other IO performance considerations. Familiar with a variety of database technology.
  • Python and C/C++ experience preferred. Extra points if you have programmed in an APL dialect (J or K/Q/Kona) or solved 100 or more problems on Project Euler.
  • You are a firm believer in unit testing and continuous integration.
  • Experience with building distributed data systems and analytics tools such as Spark, Crunch, or Pig a plus. Maybe you loved the Google Dremel white paper.
  • Experience with code generation (e.g. LLVM) or compiler technology a big plus.

UX Designer

You will be responsible for web design and crafting a compelling user experience. You will work intimately with the front-end / data visualization team to make a consistent look and feel throughout the product. We would prefer a designer who can also do her/his own JS/CSS implementation work if necessary. Any experience with data analysis tools, from Excel to Spotfire to Matlab (with opinions about what makes each of them easy–or terrible–to use), would be strongly preferred.

Full Stack Web Engineer

You will play a critical role in building a scalable, reliable web application, and generally keeping the trains running on time. You should be a jack of many trades with an interest in learning many more. Here are some desirable qualities:

  • Extensive experience using Python, Javascript (Node.js), etc. to build scalable, data-intensive web applications, and with strong opinions about the right technology to use.
  • Comfort with managing continuous (or highly frequent) deployments.
  • Experience with using and managing SQL (e.g. Postgres) and NoSQL (e.g. MongoDB) databases.
  • Experience building applications EC2 or other cloud computing services.

Product Lead

You will run product at the company, working to gain a deep understanding of customer use cases and working with the engineering team to drive product-market fit. Prior experience in the data analytics or business intelligence space would be very helpful.

Whirlwind tour of pandas in 10 minutes

10-minute tour of pandas from Wes McKinney on Vimeo.

Update on upcoming pandas v0.10, new file parser, other performance wins

We’re hard at work as usual getting the next major pandas release out. I hope you’re as excited as I am! An interesting problem came up recently on the ever-popular FEC Disclosure database used in my book and in many pandas demos. The powers that be decided it would be cool if they put commas at the end of each line; fooling most CSV readers into thinking there are empty fields at the end of each line:

pandas’s file parsers by default will treat the first column as the DataFrame’s row names if the data have 1 too many columns, which is very useful in a lot of cases. Not so much here. So I made it so you can indicate index_col=False which results on the last column being dropped as desired. The FEC data file is now about 900MB and takes only 20 seconds to load on my spinning-rust box:

For reference, it’s more difficult to load this file in R (2.15.2) (both because of its size and malformedness– hopefully an R guru can tell me how to deal with this trailing delimiter crap). Setting row.names=NULL causes incorrect column labelling but at least gives us a parsing + type inference performance number (about 10x slower, faster if you specify all 18 column data types):

If you know much about this data set, you know most of these columns are not interesting to analyze. New in pandas v0.10 you can specify a subset of columns right in read_csv which results in both much faster parsing time and lower memory usage (since we’re throwing away the data from the other columns after tokenizing the file):

Outside of file reading, a huge amount of work has been done elsewhere on pandas (aided by Chang She, Yoval P, Jeff Reback, and others). Performance has improved in many critical operations outside of parsing too (check out the groupby numbers!). Here’s the output of a recent vbench run showing the latest dev version versus version 0.9.0 (numbers less than 1 indicate that the current pandas version is faster on average by that ratio):

A new high performance, memory-efficient file parser engine for pandas

TL;DR I’ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn’t been released yet (it’s in a branch on GitHub) but will after I give it a month or so for any remaining buglets to shake out:

A project I’ve put off for a long time is building a high performance, memory efficient file parser for pandas. The existing code up through and including the imminent pandas 0.9.0 release has always been makeshift; the development focus has been on parser features over the more tedious (but actually much more straightforward) issue of creating a fast C table tokenizer. It’s been on the pandas roadmap for a long time:

pandas.read_csv from pandas 0.5.0 onward is actually very fast– faster than R and much faster than numpy.loadtxt– but it uses a lot of memory. I wrote about some of the implementation issues about a year ago here. The key problem with the existing code is this: all of the existing parsing solutions in pandas as well as NumPy first read the file data into pure Python data structures: a list of tuples or a list of lists. If you have a very large file, a list of 1 million or 10 million Python tuples has an extraordinary memory footprint– significantly greater than the size of the file on disk (can be 5x or more footprint, far too much). Some people have pointed out the large memory usage without correctly explaining why, but this is the one and only reason: too many intermediate Python data structures.

Building a good parser engine isn’t exactly rocket science; we’re talking optimizing the implementation of dirt simple O(n) algorithms here. The task is divided into several key pieces:

  • File tokenization: read bytes from the file, identify where fields begin and end and which column each belongs to. Python’s csv module is an example of a tokenizer. Things like quoting conventions need to be taken into account. Doing this well in C is about picking the right data structures and making the code lean and mean. To be clear: if you design the tokenizer data structure wrong, you’ve lost before you’ve begun.
  • NA value filtering: detect NA (missing) value sentinels and convert to the appropriate NA representation. Examples of NA sentinels are NA, #N/A or other bespoke sentinels like -999. Practically speaking this means keeping a hash set of strings considered NA and check whether each parsed token is in the set (and you can have different NA sets for each column, too!). If the number of sentinel values is small, you could use an array of C strings instead of a hash set.
  • Tolerating “bad” rows: Can aberrant rows be gracefully ignored with your consent? Is the error message informative?
  • Type inference / conversion: Converting the tokens in the file to the right C types (string, date, floating point, integer, boolean).
  • Skipping rows: Ignore certain rows in file or at end of file.
  • Date parsing / value conversion: Convert one or more columns into timestamps. In some cases concatenate date/time information spread across multiple columns.
  • Handling of “index” columns: Handle row names appropriately, yielding a DataFrame with the expected row index.

  • None of this is that hard; it’s made much more time consuming due to the proliferation of fine-grained options (and resulting “parameter hell”). Anyway, I finally mustered the energy to hack it out over a few intense days in late August and September. I’m hoping to ship it in a quick pandas 0.10 release (“version point-ten”) toward the end of October if possible. It would be nice to push this code upstream into NumPy to improve loadtxt and genfromtxt’s performance as well.

    Benchmarks against R, NumPy, Continuum’s IOPro

    Outside of parser features (i.e. “can the tool read my file correctly”), there are two performance areas of interest:

  • CPU Speed: how long does it take to parse the file?
  • Memory utilization: what’s the maximum amount of RAM used while the file is being parsed (including the final returned table)? There’s really nothing worse than your computer starting to swap when you try to parse a large file
  • I’ll compare the new pandas parser engine in a group of several tools that you can use to do the same job, including R’s parser functions:

  • R’s venerable read.csv and read.table functions
  • numpy.loadtxt: this is a pure Python parser, to be clear.
  • New pandas engine, via pandas.read_csv and read_table
  • A new commercial library, IOPro, from my good friends at Continuum Analytics.
  • To do the performance analysis, I’ll look at 5 representative data sets:

  • A 100,000 x 50 CSV matrix of randomly generated 0’s and 1’s. It looks like this:
  • A 1,000,000 x 10 CSV matrix of randomly generated normally distributed data. Looks like this:
  • The Federal election committee (FEC) data set as a CSV file. One of my favorite example data sets for talking about pandas. Here’s what it looks like when parsed with pandas.read_csv
  • Wikipedia page count data used for benchmarks in this blog post. It’s delimited by single spaces and has no column header:
  • A large numerical astronomy data set used for benchmarks in this blog post. Looks like this:
  • Here’s a link to an archive of all the datasets (warning: about 500 megabytes): Table datasets

    I don’t have time to compare features (which vary greatly across the tools).

    Oh, and my rig:

  • Core i7 950 @ 3.07 GHz
  • 24 GB of ram (so we won’t get close to swapping)
  • OCZ Vertex 3 Sata 3 SSD
  • (Because I have an SSD I would expect the benchmarks for spinning rust to differ roughly by a constant amount based on read times for slurping the bytes of the disk. In my case, the disk reads aren’t a major factor. In corporate environments with NFS servers under heavy load, you would expect similar reads to take a bit longer.)

    CPU Performance benchmarks

    So numpy.loadtxt is really slow, and I’m excluding it from the benchmarks. On the smallest and simplest file in these benchmarks, it’s more than 10 times slower than the new pandas parser:

    Here are the results for everybody else (see code at end of post):

    Here are the raw numbers in seconds:

    In [30]: results
                       iopro    pandas       R
    astro          17.646228  6.955254  37.030
    double-matrix   3.377430  1.279502   6.920
    fec             3.685799  2.306570  18.121
    wikipedia      11.752624  4.369659  42.250
    zero-matrix     0.673885  0.268830   0.616

    IOPro vs. new pandas parser: look closer

    But hey, wait a second. If you are intimately familiar with IOPro and pandas you will already be saying that I am not making an apples to apples comparison. True. Why not?

  • IOPro does not check for and substitute common NA sentinel values (I believe you can give it a list of values to check for– the documentation was a bit hard to work out in this regard)
  • IOPro returns NumPy arrays with structured dtype. Pandas DataFrame has a slightly different internal format, and strings are boxed as Python objects rather than stored in NumPy string dtype arrays
  • To level the playing field, I’ll disable the NA filtering logic (passing na_filter=False) in pandas, instruct the parser to return a structured array instead of a DataFrame (as_recarray=True). Secondly, let’s only look at the numerical datasets (exclude wikipedia and fec, for now) to exclude the impact of handling of string datatypes. Here is the resulting graph (with relative timings):

    It looks like the savings of not passing all the tokens through the NA filter is balanced by the cost of transferring the column arrays into the structured array (which is a raw array of bytes interpreted as a table by NumPy). This could very likely be made faster (more cache-efficient) than it currently is with some effort.

    Memory usage benchmarks

    Profiling peak memory usage is a tedious process. The canonical tool for the job is Massif from the Valgrind suite. I’m not yet done obsessing over memory allocation and data management inside the parser system, but here’s what the numbers look like compared with R and IOPro. I’m using the following valgrind commands (plus ms_print) to get this output (if this is not correct, please someone tell me):

    I’ll use the largest file in this post, the astro numerical dataset.

    First, IOPro advertises very low memory footprint. It does not, however, avoid having 2 copies of the data set in memory (I don’t either. It’s actually very difficult–and costly–to avoid this). Here is the final output of ms_print showing peak memory usage at the very end when the structured array is created and returned:

    Let’s look at R. Peak memory allocation comes in slightly under IOPro at 903MM bytes vs. 912MM:

    In the new pandas parser, I’ll look at 2 things: memory allocation by the parser engine before creation of the final DataFrame (which causes data-doubling as with IOPro) and the user-facing read_csv. First, the profile of using read_csv (which also creates a simple integer Index for the DataFrame) uses 1014MM bytes, about 10% more than either of the above:

    Considering only the parser engine (which returns a dict of arrays, i.e. no data doubling) uses only 570MM bytes:

    Memory usage with non-numerical data depends on a lot of issues surrounding the handling of string data. Let’s consider the FEC data set, where pandas does pretty well out of the box, using only 415MM bytes at peak (I realized why it was so high while writing this article…will reduce soon):

    IOPro out of the box uses 3 times more. This would obviously be completely undesirable:

    What about R? It may not be fast but it uses the least memory again:

    You might be wondering why IOPro uses so much memory? The problem is fixed-width string types:

    Oof. Dtypes of S38 or S76 means that field uses 76 bytes for every entry. This is not good, so let’s set a bunch of these fields to use Python objects like pandas:

    adap = iopro.text_adapter('P00000001-ALL.csv')
    adap.set_field_types({2: object, 3: object, 4: object,
                          7: object, 8: object, 13: object})
    arr = adap[:]

    Here’s the Massif peak usage which is reasonably inline with pandas:


    I’m very happy to see this project to completion, finally. Python users have been suffering for years from parsers that have 1) few features, 2) are slow, and 3) use a lot of memory. In pandas I focused first on features, then on speed, and now on both speed and memory. I’m very pleased with how it turned out. I’m excited to see the code hopefully pushed upstream into NumPy when I can get some help with the integration and plumbing (and parameter hell).

    It will be a month or so before this code appears in a new release of pandas (we are about to release version 0.9.0) as I want to let folks on the bleeding edge find any bugs before releasing it to the masses.

    Future work and extensions

    Several things could (should) be added to the parser without too much effort comparatively:

  • Integrate a regular expression engine to tokenize lines with multi-character delimiters or regular expressions.
  • Code up the fixed-width-field version of the tokenizer
  • Add on-the-fly decompression of GZIP’d files
  • Code used for performance and memory benchmarks

    R code (just copy-pasted the output I got of each command). Version 2.14.0

    Requirements for EuroSciPy 2012 pandas tutorial

    Much belatedly, here are the requirements for the EuroSciPy tutorial on pandas in Brussels tomorrow. They are the same as for the SciPy 2012 (Austin) tutorial:

  • NumPy 1.6.1 or higher
  • matplotlib 1.0 or higher
  • IPython 0.13 or higher, HTML notebook dependencies
  • pandas 0.8.1 (or better, GitHub master revision) and dependencies (dateutil, pytz)
  • One of the easiest ways to get started from scratch is with EPDFree and then installing pandas 0.8.1 or higher. If you’re using 0.8.1 we may run into a few minor bugs that I have fixed since the last release.

    © 2015 Wes McKinney

    Theme by Anders NorenUp ↑