YAML and PHP

30 Jul 2009 19:03

There are 3 main YAML implementations for PHP:

  • Syck (native C library bindings to PHP)
  • Symphony YAML (pure PHP)
  • Spyc (pure PHP again)

This is the comparison.

What the hell is YAML

Have you heard about XML or JSON? YAML is similarly to JSON and XML a way to store (read and write) structured data like arrays (a.k.a. lists), dictionaries (a.k.a. hash maps) and atomic values like strings and numbers. The structures can be nested, to form a definition of near-real-life objects, for example:

---
Piotr Gabryjeluk:
  company: Wikidot Inc.
  university: Nicolaus Copernicus University, Toruń, Poland
  lives_in: Toruń, Poland
  hobbies:
  - basketball
  - playing the guitar

Which translates to PHP:

<?php
$data = array('Piotr Gabryjeluk' => array(
  'company' => 'Wikidot Inc.',
  'university' => 'Nicolaus Copernicus University, Toruń, Poland',
  'lives_in' => 'Toruń, Poland',
  'hobbies' => array('basketball', 'playing the guitar')
));

So you see YAML is quite nice even when you need to write it yourself.

YAML has its specification (see http://yaml.org), so once we have standard YAML parser and standard YAML dumper we can send arrays from one machine to another and the result should be the same array as was sent.

PHP

So let's see what are the choices if you want to play with YAML in PHP.

Syck

This is the fastest and the most complete YAML dumper and loader library available. This is binding to C library and this is available in PEAR. It is also available as regular package in Ubuntu repository, so install it by simple:

aptitude install php5-syck

In some shared hosting environment this could be a problem, so you need a pure PHP solution.

Spyc

This was the first PHP YAML implementation I saw. It is both dumper and loader and it seemed to work fine, but then I found some bugs, that stopped me from using it as the base and only YAML loader and dumper for Wikidot.

This one has really nice thing, which is nice when you want your users to enter YAML to define things (like we do for forms). It is quite forgiving when it comes to the syntax and ignores things that don't fit and still parses the rest.

Unfortunately as I stated before Spyc dumper so, when you first dump an array and then load it with Spyc you get something different (for example multiple new-lines are treated as one). Not good. Also as a loader it does not fully understand the full YAML specification (which is quite huge BTW).

Symphony YAML

This one is pure-PHP as well, so you don't need special rights, to use it on a PHP-enabled machine.

It's loader does not understand full YAML specification, so for example you can't load documents dumped by Syck. Dumper is good.

Summary

Syck Spyc Symphony YAML
type of library PHP extension pure PHP library pure PHP library
speed fast slow slow
loader: YAML support full bad not bad
loader: if YAML is corrupted exception tries to do its best to load the rest exception
dumper: YAML human-readable more-or-less yes more-or-less if set properly
dumper: YAML conforms to spec yes no yes
loads Syck's dumper output correctly yes no no
loads Symphony's dumper output correctly yes no yes

Verdict: loader

Syck is the winner in loading YAML. If you cannot use Syck, use Symphony YAML. If you need to parse user input (which should be human readable/writable similar to YAML), use Spyc.

Actually, this is nice combination for loading:

<?php
try {
    // if syck is available use it
    if (extension_loaded('syck')) {
        return syck_load($string);
    }
    // if not, use the symfony YAML parser
    $yaml = new sfYamlParser();
    return $yaml->parse($string);
} catch (Exception $e) {
    // if YAML document is not correct,
    return Spyc::YAMLLoadString($string);
}

This way, you have the fastest library used if possible, then the best pure-PHP, and if it fails in a way, that document was badly written (by human being for example), you fall-back to Spyc.

Verdict: dumper

In my opinion Symphony YAML dumper is the best from the three in terms of usability, portability and interoperability, because its output can be read by both itself and Spyc.

However, if you dump YAML often, use (hell faster) Syck for both loading and dumping. The generated YAML won't be readable by Symphony YAML or Spyc, but this is because they don't follow the specification (so not Syck's problem in fact).

Also note, that any valid JSON dumper output is readable by standard YAML 1.2 loaders, because JSON is a subset of YAML 1.2. So if using for data exchange (and not for talking to human) any fast JSON dumper can be used.

Comments: 0

July News

28 Jul 2009 15:41

Some of you may be more used to me posting more often, than in last time.
Some of you may wonder why I stopped blogging.

Brussels

Last month was full of adventures. It started 1st of July with me going to Brussels meat our friend to talk about wikipedia-like site about art. We're going to help this man build the most complete site about art using Wikidot software!

BTW, this was my first flight in lifetime. Quite a strange feeling, but generally fine.

Forms

I was working on some nice technical and UI improvements to Wikidot, that is crucial for the art site (but really really nice for Wikidot as well, like forms for editing, entering and viewing structured data to wiki pages).

Search issues

That week was also spent on some massive Wikidot.com search engine tweaks. A stupid one-line bug, which was not exporting proper LC_ALL environmental variable in indexing script, caused many sites that used Asian or East-European languages to be not indexed (most notably the great ИСТОРИЈСКА БИБЛИОТЕКА). At first we though that we can re-index the broken sites, but our re-indexing mechanism was way too slow (would last for weeks for all broken sites).

Pieter then challenged me. He said he can index whole Wikidot in 6 hours. I thought it's not even possible, but then I started to work on that and I managed to index the whole Wikidot in less than 2 hours without indexing tags at first. Then with tags, it took 2 hours and 10 minutes or so. That was damn fast!

Inspired by this and an accident of disk full error on /var partition of our webserver (but this is why we keep user-uploaded files and other important things on separate disks), I also rewrote the incremental indexer, to work in similar way to the whole-Wikidot-re-indexer.

search-api reindex

If you care about some technical details:

  • all search operations are issued with use of search-api, a separate program that can:
    • re-index whole Wikidot
    • queue indexing page/thread
    • queue deleting page/thread
    • queue re-indexing site
    • flush queue
  • search-api is written in Python
  • search-api uses PyLucene - a native Java Lucene library binded to CPython objects with PyJCC. Compiled with GNU Java Compiler to native code (like C programs), this binding has improved performance over using Lucene with Sun's Java.
  • before rewriting it to only-Python, search-api was written in BASH and was a wrapper to:
    • java -jar searchApiHelper.jar search "phrase-to-search"
    • php search-api-helper.php flush
  • search-api also takes care of file locking to assure that
    • only one process tries to modify the index
    • items are added to queue one-after-another
    • when doing some big index modification (read: full re-index) queue is not flushed (so that after the re-index all changes are applied to new index)
    • when flushing queue takes more time, and cron tries to run more flushing processes, they simply end (so only one process flushes the queue at a time)

Union of Rock Festival

Just after week spent in Brussels in nice hotel I went to Węgorzewo, Mazury (Poland biggest lakes distinct) to have fun on rock music festival. Unfortunately, the music level was not very impressive, so I mainly enjoyed the atmosphere on the camping area.

The weather was not great. It was wet everywhere, the ground was covered in 20 centimeters of mud and it was hard to walk around without getting dirty. But during the first day of being there, I learned to do that.

Improved workflow at Wikidot

Some of you noticed, that recently we started to work more efficiently, but this is not quite true. In fact we work as efficiently as before, but we are better organized, and have better priorities on tasks. Also we keep track of what we do, so we can then tell what we've done. So for us, this is a little more work of "documenting" our work (so maybe we work even less efficiently than before?), but for the outside world, we make more noise (in a positive meaning) around that. So basically, people know what we do, what we are going to do, when they can expect changes and most importantly, they understand why some feature request is being postponed. This is (and was) because we have more important things to do, but before they couldn't tell it.

SquarkSquark turned into a professional project manager, that manages our time. pieterhpieterh decided to talk to the Community and listen to their complaints (he reads or at least skims every post on Community forums). He tells Łukasz what needs to be done, Łukasz knows when we will have time to do this. This way communication inside Wikidot improved. Also we (michal frackowiakmichal frackowiak and me) no longer look on Community forums (some of you may regret), but this allows us to concentrate on our work.

The work continues

As I mentioned before, we want to introduce a great feature to Wikidot, which is forms. But the implementation now concentrates on the open source version of Wikidot software (once it's ready, working and tested we'll copy the feature to the Wikidot.com service).

aptitude install wikidot

As forms is a huge change, I started to prepare a good ground for it and closed most important bugs in Wikidot open source and I'm about to start making Ubuntu packages for it to allow even-simpler installation on Debian-based systems. Now the installation involves only 6 child-easy steps and in fact can be done by copying&pasting a few commands.

Yesterday's party

Yesterday I went to met some old-school-times friends in the heart of the city. It was meant to be a meeting for "a beer or two" but evolved into beer and dancing till morning. That was first time I get a morning bus (not even the first) to my home just after partying.

It was such a great fun and great folks I met.

Summary

I hope with this long blog post (but divided into friendly sections ;) ) I recompensed long period of not-posting anything here.

Comments: 0

Lucene Replaces Local Searches

23 Jun 2009 18:25

As some of you know, some time ago I worked on a new Wikidot search. It is used for the Search all sites page.

This work was done, because searching the whole Wikidot database was far from being fast. At first we used Google Custom Search Engine to solve the problem, but we wanted to be independent from it. Also we wanted to include search results that are accessible by the person that searches but not by search engines (like private sites).

This worked really good. The only problem was long time spent to display the results. It was like 20 seconds, which was far better than when using previous search, but much worse than Google. This was strange, because from previous tests, we calculated the average search time should be about a second or two.

It started to be clear, when we noticed that only 20-30 searches per day are performed. The index is quite big, and it needs to be cached in RAM to work with sufficient performance.

Today I did more tests under heavy load and it seams Lucene can handle big number of queries. When users search often, the index is partially or even fully cached by the filesystem and searches are really quick!

But our main problem to solve today was slow local search a.k.a. "search this site". Moreover many concurrent queries were degrading database performance (not only for searching), so we decided to enable Lucene for local searches as well.

I must say it works really nice, fast and has a nice set of syntax tricks you can do with it, for example you can search for pages with something in tags. Just search for youtube tags:embed. This would search for pages matching youtube (in tags, title or content) and with embed in tags. If no such pages are found, partial matches are also returned, like: pages matching youtube (but with no embed in tags) or pages with embed in tags, (but not matching youtube).

To sum up, new search is faster, gives more accurate results, saves the database performance (which was the main goal) and allows nicer syntax than the old one.

Comments: 7

PHP as FastCGI backend and Lighttpd

15 Jun 2009 20:59

Wikidot + Lighttpd + PHP5

At Wikidot we use PHP5 as FastCGI backend to Lighttpd light-and-fast webserver. It works like this:

  • there are a few hundreds of php5-cgi processes (name is cgi, but they also support FastCGI mode) running and waiting to be used
  • lighttpd (only one needed!) process manages the network connections to all the clients and once the request is ready serves a static file or forwards the request to one of PHP backends processes.

We used to use internal Lighttpd FastCGI process manager, meaning the lighttpd processes actually used to start the PHPs.

Problems

We encountered some known problems of 500 (server side) errors appearing after some random time, especially under a high traffic. The typical message appearing at the Lighttpd's error.log was:

<some date>: (mod_fastcgi.c.2494) unexpected end-of-file (perhaps the fastcgi process died): pid: ...

There are plenty of reports on this in both Lighttpd's and PHP's forums, bug trackers and even some blogs.

Workarounds

We managed to write some hacky scripts that detected the situation and restarted the backends when needed. The reaction was so quick, that almost no-one noticed the error, but damn, this is not how WE solve problems.

A blind try

We decided to give spawn-fcgi a shot. What is it? It is a program that spawns FastCGI backends (independently from Lighttpd server). Why trying it? I've read somewhere, that it works more reliably than the internal Lighttpd spawner. What's interesting is that this program comes from lighttpd package, so we're in family anyway. It's mainly intended to run the FastCGI backends from different user than the webserver user or to run them on different machine(s) than the webserver machine. This can be used naturally for some smart load-balancing.

The only problem of this solution we encountered was internal limit of number of processes to spawn by a single process which was 256 (hardcoded, fixed in next versions). But at the same time, we decided to build a few FastCGI bridges (each spawning ~200 PHPs) anyway so that was no longer a problem for us.

What was quite surprising (but honestly, I deeply believed in this), our problems with 500 server errors and PHP disappeared. This configuration works for about 2 weeks now with absolutely no hacky scripts involved and no restarting needed. Cool.

Why I wrote this

I wrote this short note just for the record and to let other people know, that using spawn-fcgi instead of the internal Lighttpd's FastCGI spawner might solve their problems with PHP (FastCGI) and 500 internal server errors.

Hope this helps someone.

Comments: 1

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License