As you my or may not know, my day job is working for an online advertising company. The specific project I work on involves consuming XML formatted search feeds. We currently parse around 8,000 of these XML feeds per SECOND.
We spent months investigating the fastest way to parse XML. We eventually decided on a C library which walks the XML like a tree, it doesn’t load the XML DOM into memory like most conventional XML parsers do.
That all being said, XML is slowly dying across the internet. Most APIs are focusing on JSON interfaces instead of XML but for some reason our industry is just not willing to make the switch.
JSON is a much better structure than XML. It natively supports different datatypes like integers, strings and booleans. For example:
1
|
|
That returns true
, where something like "true" == true
does not,
which is what you get from an XML document, since everything is a
string.
However, the big difference is that JSON is MUCH easier to parse, since its much better defined. I have created a quick benchmark to show what I mean:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
Now this is a pretty quick and dirty benchmark, but I tried to show two things. One, I parse the XML using nokogiri, one of the most popular XML parsers for Ruby. Two, I also parse it using regex’s with splits.
I can probably optimize the regex / splitting code quite as bit, but really, its not even worth it. Even if I can optimize the crap out of it, I don’t think I will get performance that is much better than the JSON parser, plus it is complex, confusing and ugly.
Then I simply use the native JSON parser to parse a JSON file.
I also test with two files, one that has 20 entries, the other that has 2 entries.
Here are my benchmark results for 100,000 parses.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Now my conclusions are this:
- XML XPath processing with nokogiri is slow.
- I have to walk over the XML and pull out the bits I want, this is extra work.
- Parsing XML with regex’s and/or splits is ugly, and still slower.
- XML files are larger in general than JSON file, simply because of the syntax.
- Parsing JSON immediately gets me a data structure I can deal with natively in Ruby. (and most other languages)
In my case, the XML files are 20% larger, thats a 20% increase bandwidth cost. XML parsing takes 200x longer, thats a lot of CPU resources used up.
100,000 parses is roughly 12.5 seconds of our current traffic. This
benchmark shows that it would take 200+ seconds to deal with that many
parsers in Ruby using XPaths. Using regex/splits, I would almost be able
to keep up, but still not quite yet. JSON.parse
would be able to handle
it, with some time to spare. Obviously our production environment runs
on faster servers and more of them (and also runs under Erlang with a C
library, not Ruby). But we still have a significant overhead caused
simply by parsing XML.
Moral of the story, even if have much less API load, save yourself some CPU cycles, use JSON instead!
You can download the entire benchmark from https://github.com/tecnobrat/xml-vs-json