Why Are We Still Using XML?

As you my or may not know, my day job is working for an online advertising company. The specific project I work on involves consuming XML formatted search feeds. We currently parse around 8,000 of these XML feeds per SECOND.

We spent months investigating the fastest way to parse XML. We eventually decided on a C library which walks the XML like a tree, it doesn’t load the XML DOM into memory like most conventional XML parsers do.

That all being said, XML is slowly dying across the internet. Most APIs are focusing on JSON interfaces instead of XML but for some reason our industry is just not willing to make the switch.

JSON is a much better structure than XML. It natively supports different datatypes like integers, strings and booleans. For example:

1
JSON.parse('{"is_true": true}')["is_true"] == true

That returns true, where something like "true" == true does not, which is what you get from an XML document, since everything is a string.

However, the big difference is that JSON is MUCH easier to parse, since its much better defined. I have created a quick benchmark to show what I mean:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
require 'nokogiri'
require 'json'
require 'benchmark'

small_xml_file = open("datafiles/small.xml").read.gsub(/\n/, "").squeeze(" ")
large_xml_file = open("datafiles/large.xml").read.gsub(/\n/, "").squeeze(" ")

small_json_file = open("datafiles/small.json").read.gsub(/\n/, "").squeeze(" ")
large_json_file = open("datafiles/large.json").read.gsub(/\n/, "").squeeze(" ")

def regex_xml(file)
  doc = /\<results\>(.*)\<\/results\>/.match(file)[1]
  doc = doc.split(/\<result\>(.*?)\<\/result\>/)
  doc.map do |result|
    next if result.strip == ""
    result = result.split(/\<.*?\>/)
    {
      :title => result[1],
      :description => result[3],
      :url => result[5]
    }
  end.compact
end

def xpath_xml(file)
  doc = Nokogiri::XML::Document.parse file

  doc.xpath('//results/result').map do |node|
    {
      :title => node.xpath('title').text,
      :description => node.xpath('description').text,
      :url => node.xpath('url').text
    }
  end
end

def parse_json(file)
  JSON.parse(file)
end

n = 100000

Benchmark.bmbm do |x|
  x.report("large json") { n.times { parse_json(large_json_file) } }
  x.report("small json") { n.times { parse_json(small_json_file) } }
  x.report("large xml xpath") { n.times { xpath_xml(large_xml_file) } }
  x.report("small xml xpath") { n.times { xpath_xml(small_xml_file) } }
  x.report("large xml regex") { n.times { regex_xml(large_xml_file) } }
  x.report("small xml regex") { n.times { regex_xml(small_xml_file) } }
end

puts
puts
puts "JSON Large File Size: #{large_json_file.size}"
puts "JSON Small File Size: #{small_json_file.size}"
puts "XML Large File Size: #{large_xml_file.size}"
puts "XML Small File Size: #{small_xml_file.size}"

Now this is a pretty quick and dirty benchmark, but I tried to show two things. One, I parse the XML using nokogiri, one of the most popular XML parsers for Ruby. Two, I also parse it using regex’s with splits.

I can probably optimize the regex / splitting code quite as bit, but really, its not even worth it. Even if I can optimize the crap out of it, I don’t think I will get performance that is much better than the JSON parser, plus it is complex, confusing and ugly.

Then I simply use the native JSON parser to parse a JSON file.

I also test with two files, one that has 20 entries, the other that has 2 entries.

Here are my benchmark results for 100,000 parses.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Rehearsal ---------------------------------------------------
large json       12.100000   0.000000  12.100000 ( 12.182863)
small json        1.640000   0.000000   1.640000 (  1.649213)
large xml xpath 225.450000   0.170000 225.620000 (225.957241)
small xml xpath  28.250000   0.020000  28.270000 ( 28.307895)
large xml regex  37.880000   0.000000  37.880000 ( 37.936387)
small xml regex   4.230000   0.000000   4.230000 (  4.235283)
---------------------------------------- total: 309.740000sec

                      user     system      total        real
large json       12.160000   0.000000  12.160000 ( 12.176090)
small json        1.710000   0.000000   1.710000 (  1.714853)
large xml xpath 223.850000   0.170000 224.020000 (224.371979)
small xml xpath  28.300000   0.040000  28.340000 ( 28.382887)
large xml regex  37.880000   0.000000  37.880000 ( 37.924334)
small xml regex   4.160000   0.000000   4.160000 (  4.163718)


JSON Large File Size: 1956
JSON Small File Size: 210
XML Large File Size: 2597
XML Small File Size: 311

Now my conclusions are this:

  • XML XPath processing with nokogiri is slow.
  • I have to walk over the XML and pull out the bits I want, this is extra work.
  • Parsing XML with regex’s and/or splits is ugly, and still slower.
  • XML files are larger in general than JSON file, simply because of the syntax.
  • Parsing JSON immediately gets me a data structure I can deal with natively in Ruby. (and most other languages)

In my case, the XML files are 20% larger, thats a 20% increase bandwidth cost. XML parsing takes 200x longer, thats a lot of CPU resources used up.

100,000 parses is roughly 12.5 seconds of our current traffic. This benchmark shows that it would take 200+ seconds to deal with that many parsers in Ruby using XPaths. Using regex/splits, I would almost be able to keep up, but still not quite yet. JSON.parse would be able to handle it, with some time to spare. Obviously our production environment runs on faster servers and more of them (and also runs under Erlang with a C library, not Ruby). But we still have a significant overhead caused simply by parsing XML.

Moral of the story, even if have much less API load, save yourself some CPU cycles, use JSON instead!

You can download the entire benchmark from https://github.com/tecnobrat/xml-vs-json

Comments