Piotr Gabryjeluk blog

Wikidot API

1232650627|%e %B %Y

A few days ago I started working on Wikidot API. The API will be a standardized way to access the Wikidot.com service in a programmable way (i.e. not using a browser) to retrieve, create and update information stored on Wikidot, including site browsing, page editing and commenting.

In simple words this will allow people to write applications that connect to Wikidot.com and perform some actions for the user that runs the application.

Technically, the Wikidot.com API is an XML-RPC service exporting methods from a few especially designed classes.

To connect to an XML-RPC service, you must know its endpoint, which is a regular URL (http:// or https://) address. We decided to use HTTPS to secure the channel from the very start.

The operations we are going to support are:

Browse

  • site.categories
  • site.pages
  • page.get

Above ones are already implemented. Using the API calls you get retrieve almost all data you stored on the Wikidot.com sites!

Modify

  • page.save

This will be the basic method to update the content on your site. We plan several other methods, but this is the one that is the most important.

Comments

  • page.comments
  • page.comment

They will be used to get and post comments on a given page. Using reply_to parameter, there is a possibility to reply to a particular comment.

Forum

  • forum.groups
  • forum.categories
  • forum.threads
  • forum.post

This bunch of methods are going to give you full access to the forums you have started on Wikidot.

How to use the API

We haven't yet enabled the API access to the main Wikidot.com server, but testing the API with Python XML-RPC library is as easy as this:

>>> from xmlrpclib import ServerProxy
>>> s = ServerProxy('SOME-URL')
>>> s.system.listMethods()
['system.listMethods', 'system.methodHelp', 'system.methodSignature', 'system.multicall', 'site.pages', 'site.categories', 'page.get']
>>> print s.system.methodHelp('site.pages')
Get pages from a site
 
Argument array keys:
 site: site to get pages from
 category: category to get pages from (optional)
>>> s.site.categories({'site': 'gamemaker'})
['project', 'rpg', '_default', 'action', 'admin', 'badge', 'beginner', 'contests', 'error', 'event', 'example', 'forum', 'gamemaker', 'gml', 'gmlcode', 'gmupload', 'help', 'ide', 'include', 'mamber', 'member', 'nav', 'portal', 'resource', 'search', 'system', 'talk', 'template', 'tutorial', 'video-tutorial', 'wiki', 'challenge', 'howto', 'recent-changes', 'scratch-pad', 'helpdesk', 'helprequest', 'default', 'testimonial', 'testimonials']
>>> [p['name'] for p in s.site.pages({'site': 'gamemaker', 'category': 'badge'})]
['cool', 'flux', 'c-team', 'gml', 'member', 'madman', 'not-a-noob', 'f-madman', 'start', 'break-it']
>>> print s.page.get({'site': 'gamemaker', 'page': 'badge:cool'})['source']
[[table style="width:98%;margin-right:auto;margin-left:auto;margin-bottom:1%;"]][[row]][[cell style="width:360px;"]]
 
[[div class="error-block"]]
Included page "include:cool-badge" does not exist ([/include:cool-badge/edit/true create it now])
[[/div]]
 
The Cool Badge is given to people who make something really cool.
 
[[/cell]]
[[cell style="vertical-align:top;border:1px solid #ddd;padding:1%;"]]
+++ Display Code
 
[[table class="code"]][[row]][[cell]]
@@[[include include:cool-badge member=member name]]@@
[[/cell]][[/row]][[/table]]
 
+++ Tag
 
{{cool-badge}}
[[/cell]][[/row]][[/table]]
 
[[table style="border:1px solid #ddd;padding:1%;margin-right:auto;margin-left:auto;margin-bottom:1%;width:98%;"]][[row]][[cell]]
 
++ Earn It
 
* Program something really cool using gml
* Make a really cool game
 
+++ Tips
 
* Make sure you post your examples and games on the [[[forum:start|forum]]]. Otherwise no one can see it and nominate you for the cool badge.
[[/cell]][[/row]][[/table]]
 
[[table style="border:1px solid #ddd;padding:1%;margin-right:auto;margin-left:auto; width:98%;"]][[row]][[cell]]
++ Members Who Have Earned the Cool Badge
 
[[module ListPages category="member" order="titleAsc" tag="cool-badge" perPage="100" separate="false"]]
* %%linked_title%%
[[/module]]
 
[[/cell]][[/row]][[/table]]
>>>

A few words of explanation:

  • first we import ServerProxy class from XML-RPC library,
  • then we construct the ServerProxy object s supplying the endpoint URL (SOME-URL in this case, as we don't have yet decided what the URL is going to be)
  • we can see a list of methods by calling system.listMethods on the ServerProxy object
  • we get a help message for a method by calling system.methodHelp
  • then we get categories of site gamemaker (yeah, it's a part of the wikicomplete.info)
  • then we call site.pages method (specifying site and category parameters), but instead of displaying the whole list of structures that describe pages, we only display their names
  • calling page.get returns an array with the information about a page, including:
    • wiki source, array key: source
    • generated HTML, array key: html
    • array with various meta-data, array key: meta
  • we call page.get passing as the argument array that specifies site and page name, get the page object, but displays only what's stored under the source array key

As you see playing with this is really easy as is browsing the available methods and using them.

Why XML-RPC

We've chosen this protocol because it is an easy way to develop both server and client in almost any programming language. Also it gives some flexibility in passed arguments and return values.

We use struct XML-RPC type as the argument and return value type, which is mapped to associative array or dictionary in client (and server) libraries. Each API method gets a bunch of required and optional parameters, that are basically values stored in the struct passed to API methods.

For example site.pages gets a struct with the following keys:

  • site (site name to get pages from) — required
  • category (category to get pages from) — optional

This means, you have to create an associative array (when using PHP) or a dictionary (using Python) and pass it as the method argument:

# PHP
$pages = $server->site->pages(array("site" => "my-site", "category" => "my-category"));
# Python
pages = server.site.pages({"site": "my-site", "category": "my-category"})

Using other programming languages, you'll end with something similar. You can almost always create the array/dictionary in-place, so having this convention is not a big deal.

Applications

I'm working on a filesystem based access to Wikidot site (using FUSE and Python).

We plan having a Wikidot application for iPhone.

A save-it-directly-on-wikidot plugin would be a nice thing for various text editors (and probably other applications).

And probably there are billions of other ways to use this API we're not even aware of. If you have any, feel free to leave a comment.

Comments: 1, Rating: 0

Bridging Python And PHP

1231670896|%e %B %Y

Imagine you have a PHP-based application (like Wikidot). Now, you want to extend it using Python. Through all ways to do it, I'll show you how to achieve this using XML-RPC protocol.

Background

XML-RPC is a client-server protocol for remote procedure call.

On server this works like getting a bunch of functions from your application and exporting it with HTTP.

On client this works like connecting to a XML-RPC server, finding out what function it delivers and constructing a so called server proxy — an object having a method for every function exported by an XML-RPC server.

Calling the methods of the server proxy connects to the server using HTTP, passes arguments and transport the result back to the client. So basically this works AS you have a remote located object locally available.

The data encoding between client and server is defined in XML-RPC specification and is a language based on XML (but you actually never touch it, the XML is converted to objects by libraries).

Overview

We want to run an XML-RPC server exposing a class in PHP and an XML-RPC client in Python to communicate with the XML-RPC server.

Traditionally we would need to have an HTTP server for the PHP XML-RPC server, because HTTP is used as the XML-RPC transport. But digging a bit into the specification, you'll discover, that none HTTP-specific parts of the protocol are used. It's just used as a line to transport the XML data.

So you may wonder if it's possible to use XML-RPC with transport other than HTTP. In short, yes. But you may need to hack around the XML-RPC libraries (because they usually suppose you'll want to use HTTP).

PHP XML-RPC server

First, you need some class, that you want to expose with PHP XML-RPC:

<?php
 
class MyClass {
    /**
     * @param string $input
     * @return string
         */
    public function repeat($input) {
        return $input;
    }
}

Notice I've set the parameter and return type in phpdoc.

Now let's expose this class with Zend Framework XML-RPC implementation.

You need to download Zend Framework first, let's say to /path/to/zf directory.

<?php
 
class MyClass {
    /**
     * @param string $input
     * @return string
         */
    public function repeat($input) {
        return $input;
    }
}
 
set_include_path(get_include_path() . PATH_SEPARATOR . 'zf/library');
require_once "Zend/XmlRpc/Server.php";
 
$server = new Zend_XmlRpc_Server();
$server->setClass('MyClass', 'myclass');
echo $server->handle();

Set_include_path line adds the /path/to/zf/library directory to PHP path, so you can import the Zend_XmlRpc_Server class (located in /path/to/zf/library/Zend/XmlRpc/Server.php file).

Then there is an instance of Zend_XmlRpc_Server created, then there is MyClass attached as the class for myclass XMLRPC namespace. This means the repeat method is to be called via the XML-RPC as myclass.repeat.

If you place the file on your server and have it under some URL, for example:

http://your-server.com/myclass.php

This URL is fully valid XML-RPC server endpoint for XML-RPC clients.

Python client

Having the XML-RPC server running we can connect to it from any XML-RPC enabled library in any programming language around.

In Python, to call the remote procedure myclass.repeat on the XML-RPC endpoint http://your-server.com/myclass.php, you would do the following:

from xmlrpclib import ServerProxy
 
server = ServerProxy('http://your-server.com/myclass.php')
print server.myclass.repeat('Hello RPC service')

Running this code:

# python xmlrpc-test.py

gives you:

Hello RPC service

Under the hood:

  • Python script makes a connection to http://your-server.com/myclass.php
    • your webserver runs the myclass.php script
      • the $server->handle() line processes the data received
        • chooses a class and a method to run (this would be MyClass and repeat)
        • passes the arguments (a string 'Hello RPC service') to the method
        • gets the return value
      • passes it back to the client wrapped in XML-RPC protocol
  • Python gets XML reply and converts it back to simple string ('Hello RPC service')
  • and prints it on the console

Omitting the HTTP protocol

Probably you have both Python and PHP scripts to be run on the same machine, so the HTTP part is quite useless and an additional point of failure.

As I already stated, the HTTP is only a transport and you can replace it (with some cost) with some other transport.

I came into an idea to use stdout/stdin as the transport, so Python would execute a PHP script (command line interface) and pass the XML-RPC request to the script's stdin. PHP would then have to get the XML-RPC request from stdin instead of from HTTP request.

This means two modifications in server and client code.

First the server:

<?php
 
class MyClass {
    /**
     * @param string $input
     * @return string
         */
    public function repeat($input) {
        return $input;
    }
}
 
set_include_path(get_include_path() . PATH_SEPARATOR . 'zf/library');
require_once "Zend/XmlRpc/Server.php";
require_once "Zend/XmlRpc/Request/Stdin.php";
 
$server = new Zend_XmlRpc_Server();
$server->setClass('MyClass', 'myclass');
echo $server->handle(new Zend_XmlRpc_Request_Stdin());

The change is passing an instance of Zend_XmlRpc_Request_Stdin to $server->handle(). This is all needed. Guys from Zend Framework already predicted such a use.

Then, the client part.

Xmlrpclib allows passing a custom transport in case you want to implement some proxies or other thing. We'll make a transport, that instead of making a HTTP connection, runs a PHP script, passes the request to its stdin and gets the response from stdout:

from xmlrpclib import Transport, Server
from subprocess import Popen, PIPE
 
class LocalFileTransport(Transport):
    class Connection:
        def setCmd(self, cmd):
            self.cmd = Popen(['php', cmd], stdin=PIPE, stdout=PIPE)
 
        def send(self, content):
            self.cmd.stdin.write(content)
            self.cmd.stdin.close()
 
        def getreply(self):
            return 200, '', []
 
        def getfile(self):
            return self.cmd.stdout
 
    def make_connection(self, host):
        return self.Connection()
 
    def send_request(self, connection, handler, request_body):
        connection.setCmd(handler)
 
    def send_content(self, connection, request_body):
        connection.send(request_body)
 
    def send_host(self, connection, host):
        pass
 
    def send_user_agent(self, connection):
        pass
 
server = Server('http://host.com/path/to/the/php/script/myclass.php', transport = LocalFileTransport())
print server.myclass.repeat('Hello XML-RPC with no HTTP service')

Notes:

  • host.com in the URL is completely ignored, use whatever value you want
  • /path/to/the/php/script/myclass.php in URL is passed as the PHP script to run

What to do next?

Having this simple skeleton, you can now extend the MyClass, actually give it more proper name first! You can also attach more classes to the XML-RPC server using different namespaces:

$server->setClass('SomeClass', 'some);
$server->setClass('MyClass', 'my');
$server->setClass('YourClass', 'your');

Only public methods are exposed to the XML-RPC clients, so you can hide some logic inside of private or protected methods and only expose what you need from given classes.

This solution is a quick way to actually use some of your well-working PHP code in your fancy-new and elegant Python application. This can help if you want to make a filesystem with Python-FUSE, but want to data be taken from PHP application.

Did it help you?

I hope this helps someone. Feel free to comment.

Comments: 1, Rating: 0

Wikidot Search Ready To Test

1229642567|%e %B %Y

Lucene-based search - a brand new feature has been being introduced to Wikidot code for quite a long time.

After that time of developing and dealing with performance problems (searching the whole Wikidot in 3 seconds is too long!) it's time for test this thing!

Introduction

The whole thing is about Search All Sites module for Wikidot Open Source. This entry is mainly for those of you that run their own Wikidot services. More info on getting Wikidot software run can be found on Ed's site.

Fresh install

After installing Wikidot Open Source from current version, you can instantly test the new Search All Sites module. Just navigate to /search:all page of your main wiki. Example: if your wiki farm runs on the domain mydomain.com and your main wiki is www.mydomain.com, navigate to http://www.mydomain.com/search:all.

Existing install

Updating to the new search engine from already installed version is quite tricky, because you need to pre-populate the search index (which is the actual file that is searched by the index, when looking for term user entered).

You can try the following commands:

  • obtain root priviledges
sudo su
  • navigate to your Wikidot directory (it is /var/www/wikidot by default), update the code and run the lucene_bootstrap.php script as your lighttpd user
cd /var/www/wikidot
svn update
cd tests
sudo -u www-data php lucene_bootstrap.php

This command adds every page to the index (normally located at /var/www/wikidot/tmp/lucene_index). Once indexed a page can be searched (if the site containing the page is public or you're a member of the site). The command prints a dot for each 10 indexed items (item is every page and forum/comments thread).

If this runs smoothly (i.e. no error, Segmentation fault at the end is OK, but memory exhausted is not OK) you have all your sites indexed and ready to search through.

When it fails: you can increase the max_memory setting in the corresponding php.ini file and re-run the command. There is no bad thing in running this command more than once as indexing a page always deletes the page from index before adding it again.

Just go to /search:all location at your main wiki and search for some content.

Also you need to update your crontab file. Add:

* * * * *      www-data /var/www/wikidot/bin/job.sh UpdateLuceneIndexJob

to your /etc/crontab (assuming you have wikidot in /var/www/wikidot/). This will add an every-minute job indexing pages and threads queued to index when saving or changing public/private site state.

Features

  • First of all the new search applies only to the Search All Sites i.e. Search This Site works in the old way.
  • Search uses titles and tags intelligently
    • pages with the exact search phrase in the title are placed higher in the result list
    • pages with tags matching search phrase are quite high in the result list
    • pages with title matching search phrase are quite high in the result list
    • pages with content matching search phrase are somewhere low in result list
    • pages with parts of search phrase matching titles and tags can be higher in the result list than the pages having content matching even the exact phrase
    • this all means: tags and titles are more important than content for the search engine
  • You can narrow your search to only selected wikis
    • append site:site1,site2,site3 (no spaces between them) to your search query. Example: search for "gabrys site:www,community" searches for gabrys in titles, tags and contents of pages and threads inside of sites www.yourdomain.com and community.yourdomain.com (supposing your Wikidot installation runs on yourdomain.com)
  • The search includes public sites plus sites you are a member of. Also the results from your sites are generally more relevant to the search engine (i.e. they appear higher than the results from other sites)
  • The search results for given phrase for given user are cached (if memcached is used) for a few minutes. This makes the search even more smooth (no need to search the index again when user only switches the result page from 2 to 3 for example)

Test it!

If you don't run and don't want to run your own Wikidot installation, you can try the new features on the following site:

http://www.wikicomplete.info/search:all

Extras

Highlighting

Like Google's way to highlight the searched words in Google-cached versions of result pages? Now you can add this feature to your Wikidot installation as well.

Open conf/wikidot.ini file and append those lines:

[search]
; enables highlighting of search phrase in the resulting documents
highlight = true

This will highlight the words user searched for using:

  • Google Search
  • Yahoo Search
  • Your Wikidot installation search:all page

Need more performance or memory limit exhausted

We experienced some low performance when searching through 2 millions of pages and threads of Wikidot.com. The search results were generated in about 3 seconds. This was not enough for us, so we manage to speed things up using the native Java Lucene implementation for searching the index. This works because we use PHP Lucene implementation that is compatible with the Java's one. This means we can index page with PHP and search with Java. And we do it! If you want do this too (experiencing low search performance of getting memory exhausted error messages), just add the following lines to your conf/wikidot.ini file:

[search]
; enables the use of Java for searching
use_java = true

Notes

  • if you already have [search] section in the conf/wikidot.ini file, just add the use_java = true line in the search section
  • enabling Java for searching requires you to install java executable for your system. You should know how to do this (try sudo aptitude install openjdk-6-jre).
  • you don't need any Java libraries as we already bundled everything needed in the .jar file. The Java source and Ant build script is located normally at the /var/www/wikidot/java directory (assuming you installed the wikidot in /var/www/wikidot.

Summary

Once we assure the search is stable and gives relevant results, we'll introduce it to the Wikidot.com service. I calculated that indexing all the sites would take about 3 days! But searching is done in less than 1 second (using the Java program).

I'm looking for your comment on the features. Especially if you've tried them yourself!

Comments: 6, Rating: 1

New Search For Wikidot

1228337889|%e %B %Y

As some of you already noticed, we now use Google Custom Search for searching the whole Wikidot. The reason for this was quite obvious.

We now host 2 million pages and the full text search engine we had used was not fast enough to satisfy a regular user. The average search time (in all wikis) was about 30 seconds.

Google searches the whole Wikidot in less than 1 second. The downsides of using Google engine are:

  • using external service — prestige and dependence
  • displaying ads on search results — for those who don't use AdBlock
  • pages get indexed after some significant time
  • only public wikis can be indexed

One important thing is Google indexes every content from a site. This includes Wikidot.com footer, menus, wiki header, the real content and tags. All of these is treated in an unknown way, so we have no big/real impact on how Google treats different portions of pages.

This leads to a conclusion, that we need a search engine.

We would like:

  • to treat tags as more important than the regular content
  • not to search from Wikidot.com static elements (like the footer on every page)
  • allow searching all wikis available for given user
    • all public wikis
    • all private wikis that the user is a member of

Coming to technical details, one could say, we just need a generic full text search engine. We can use one available in our storage system or one of dedicated search-only engines.

Tsearch

Tsearch — the full text search engine for PostgreSQL (storage used for Wikidot data) is currently used when searching a wiki. This is quite nicely integrated and plays well. But when there are over 20,000 wikis to search (I mean only non-spam public ones), the efficiency is not enough.

Lucene

Lucene — one of the most popular search engines for Java is one of the possible choices. The mechanism it works is the following:

  • application pulls some documents to the search index
    • document is a webpage in our situation
    • index is a datastore to be used when searching
  • user queries the index with a query
    • a bunch of documents is returned in the order of relevance
    • the documents returned are more or less the same as the documents pulled to the index before

This mechanism requires populating index by application

  • updating the index every now and then
  • updating document on some change — like page edit

and requires us to define some functions that deal with search results

  • the original webpage is usually not stored in the index, but only tokenized, to allow finding it when searching for any word in the document
    • this makes the index smaller and faster
    • this makes we need to store an additional ID of a webpage, to be able to retrieve the full result from the database based on the stored ID

Nutch

Nutch would be a different — more Googlish — approach to the search issue. Nutch indexes mainly HTML files (given URLs) and crawls through the the links. This has both advantages and disadvantages:

The main advantage of using Nutch is that as a search result we get a formatted HTML document

  • with links to items found
  • with context of the search phrase quoted
  • the search phrase words outlined in some way

This is very similar to what we get searching for some phrase with Google.

What I don't like about Nutch is quite big overhead of populating the index. A page must be compiled by the server and HTML must be produced. Then the same HTML must be parsed by the search engine to get important data. There is a lot of information generated by the server and then forgot by the search engine.

Nutch (similar to Lucene) is a Java project and requires some Java environment. This may and may not be a problem, but is a point we must concern when looking for optimal solution.

There is OpenSearch project which aims to make the Nutch results more interchangeable (exporting them as RSS feeds). Using it a PHP application can safely ask for results a HTTP service and get RSS feed to parse and present to user.

Zend_Search_Lucene

There is also a quite nice thing around: Zend_Search_Lucene. It is a search engine written entirely in PHP being a part of Zend Framework for PHP. The internal format of the search index file is compatible with Lucene and this is where the name of the package comes from. Also the query language is the same (or very similar).

It seems, the PHP implementation should be really slow, when searching really big sets of data, but after some testing, we get the search results for almost any query in about 1 second, searching almost the whole Wikidot.

I think this is a really nice solution, because it can integrate well with the existing PHP code of Wikidot. Also the searching can be easily parallelized for many machines. For example, you can have 4 search machines, each getting 1/4 of search queries to carry out. This way we don't reduce the search time, but avoid searching many things in one index at once.

There are some options to consider when dividing the search queries to different machines. We can select the machine to perform the search by random, by turn or by search hash. Search hash would make a MD5 sum (or other hash) of a query, compute the modulo rest from division by number of machines from the hash (treated as an integer) and assign the search to the machine having number of computed modulo. This means the same query will always be performed on the same machine (it can be then better cached or optimized).

The Zend implementation of Lucene is also really trivial to understand and use, so it seems a good start for me. Testing it on the whole public part of Wikidot I got a index of about 500 MB. Adding a single page to the index of this size takes about 2 seconds. Searching — about 1 second.

Sphinx

When I have asked my friend about full text search engines he recommends, he pointed out Sphinx — standalone application for this purpose. It is not very popular software as it haven't find it way to the Ubuntu repository for example, but it seems very interesting.

Sphinx can be fed with XML streams of data from any application, can fetch data from PostgreSQL or MySQL databases or be communicated with via its API and libraries to many languages.

It seems it's somehow similar to Lucene, but implemented using traditional languages, not Java.

The choice

There are probably some other solutions that are worth trying, but I think the most appropriate for now is using the Zend's one as it's the easiest to adapt. We can optionally use some caching mechanisms and queries distribution. Also we need more testing of situations that may appear (like a need to perform 100 queries simultaneously).

Comments: 7, Rating: 0

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License