Archive Page 2

Available in the new 0.9 release of the WebMynd Firefox add-on are two new experimental search interfaces, Manhattan and Osaka:

These new interfaces were designed based on some of the feedback we received on our first experimental interface, Phoenix. In contrast to Phoenix, which minimizes visual clutter by only presenting the top three results from each of your selected sources on the main screen, Manhattan and Osaka display all your results within scrollable frames in a magazine-style layout. The idea is to let you browse the most possible information with the least possible interaction.

We’re just getting started with these features, and there’s a lot of work yet to be done. But we hope that by offering new kinds of search experiences we can help improve one of the most important (and often most frustrating) aspects of using the web.

Install WebMynd 0.9 to try our new interfaces on your own searches.


We’ve been working on some exciting new features over the last few weeks. One of them is Phoenix, an experimental new search interface that offers the functionality of Google and the WebMynd sidebar with a clean and engaging look and feel.

Phoenix is the first of several new interfaces we’re working on. Our hope is that each will offer a unique experience, so that you can choose the one that’s best suited to the way you search. We think Google’s interface has grown stale from a lack of innovation and there’s a wide range of design possibilities that have yet to be explored. Our lead designer, Imran Zaidi, has written an article outlining the thinking behind Phoenix and its underlying design framework.

Phoenix is included in the new 0.8.3 release of the WebMynd extension for Firefox.


Relational databases, and the object-relational mapping layers which abstract them, are not particularly well suited to storing large blobs of data: images, videos, pictures, compressed files and so on.

Far better than streaming megabytes of binary to the database is to instead keep a reference into a separate store, better suited to the task of saving and serving files.

At WebMynd, we use SQLAlchemy as our ORM and Amazon’s Simple Storage Service (S3) to store our files. We’ve used Boto to create a convenient, transparent way to store a file in SQLAlchemy, with the actual data of the file actually residing in S3. These files can then be served directly from S3, decreasing database size and I/O load, and potentially reducing bandwidth costs.

Transparent changes to file content

Suppose the objects we wish to be backed by S3 have a content attribute, which is the file body itself. What we’re aiming for is to be able to do something like:

file = session.query(File).get(file_id)
file.content="new content"
session.save_or_update(file)
session.flush()

This can be achieved by creating a property on the SQLAlchemy model class:

    def _set_content(self, cont):
        s3     = boto.connect_s3(aws_id, aws_key)
        bucket = s3.get_bucket(s3_bucket)
        key    = bucket.get_key(self.key)
        if not key:
            key = Key(bucket=bucket, name=self.key)
        key.set_contents_from_string(cont)
        # if you want to serve files directly from S3:
        key.make_public()
    def _get_content(self):
        s3     = boto.connect_s3(aws_id, aws_key)
        bucket = s3.get_bucket(s3_bucket)
        key    = bucket.get_key(self.key)
        if not key:
            pass # complain
        else:
            return key.get_contents_as_string()
    content = property(_get_content, _set_content)

Cleaning up S3 artifacts

The task of keeping S3 synchronised with the database state seems like it would be awkward, perhaps involving database triggers and queues of reconciliation tasks. I was pleasantly surprised to find that SQLAlchemy has an excellent MapperExtension class, which gives you a bunch of hooks to hang custom code off. For example, to delete an S3 key when a SQLAlchemy File object is deleted, you would do something like:

class CleanupS3(MapperExtension):
    def after_delete(self, mapper, conn, inst):
        s3     = boto.connect_s3(aws_id, aws_key)
        bucket = s3.get_bucket(s3_bucket)
        key    = bucket.get_key(inst.key)
        if key:
            key.delete()
        else:
            pass # complain
        return orm.EXT_CONTINUE

mapper(File, file_table, extension=CleanupS3())

A script with a working example can be found here. It requires Boto, SQLAlchemy and some AWS configuration. In real-world usage, you’d want some more error-checking, handling of mime types and you may choose to stream in the file content with Boto’s set_contents_from_file method. You’ll also note that we connect to S3 for every method invocation; if you have frequent changes to file content, using a connection pool for Boto might help improve performance.


Most software projects start with a nice, clean, compartmentalised architecture, whether real or imagined. As implementation progresses, the lines between components tend to blur as unforeseen dependencies emerge and edge cases are dealt with.

However, by the time it comes to deployment, you’ll probably still have a number of separate packages, with some (hopefully acyclic) dependency graph binding them together.

At WebMynd, the web tier runs on Turbogears, a Python web framework. Turbogears is by its nature very modular, with various options for “plugging in” alternative tools and extensions, which has led us to be quite modular with our own code.

Dependencies between these packages is managed via the install_requires setuptools parameters, e.g.:

    install_requires=[
        "TurboGears",
        "SQLAlchemy",
        "MiniMock >= 1.2.2",
        "Boto >= 1.5",
        "Sphinx",
        "WMQueueLib",
        "WMModel",
    ],

Here, the “WM…” packages are internal, and we don’t really want to share them on PyPI. So how best to get them installed onto the machines where they’re required?

One option is to grab the source code directly, build and install it into place. Even if your code is in a DVCS, this process can complex, and you’re going to have to store somewhere the URLs and/or levels that each package depends on from the others. But this information is already encoded in a much more concise and flexible way: the install_requires declarations!

We’ve found it convenient to take advantage of this version-controlled dependency graph by hosting our own little package index internally. It’s nice and easy: all that’s required is some easy_install configuration like this:

[easy_install]
find_links = http://internal_server.webmynd.com/packages/

Our internal_server is only accessible from a restricted set of IPs, but you could use other security measures – I’ve just tried basic HTTP authentication and it works: just prepend username:password@ to the domain.

There are a few places you can put this configuration, but we include it in setup.cfg in all our packages, so that install dependencies just take care of themselves, with no hassle and no changes required on the machines. Installing a package is as simple as:

easy_install WMWebTier

Rather than making sure that the right source is pulled down on the right machine at the right time, now you can safely push all your good builds up onto your internal package index and trust that the client selects the right one. You’ve already encoded dependencies in your package metadata – relax and let easy_install do the hard work for you!


Last week we removed the ‘Ask Twitter’ feature from WebMynd’s interface on search results pages. The idea of the feature was that, when you search, as well as being shown results from your favourite sources on the right-hand side of the results page, you could also ask your Twitter followers for help with your search.

It seemed to be an exciting idea when we first thought of it. By putting the ability to post to Twitter right into your search workflow, we thought to improve Twitter’s utility for you and make your search more social. Early beta testing by the WebMynd team seemed to validate this – I found a great fish and chip shop in Covent Garden, London by asking my Twitter network after I had searched unsuccessfully. It would never have occurred to me to ask had the interface not been right there on the search page.

We removed this feature last week and I’ll cover the reasons. It also got me thinking in more general terms about the circumstances under which you would consider removing a feature, since this isn’t the first time we’ve done it: in early versions of WebMynd last summer had a feature allowing offline access of your web history, and a feature which let you publish parts of your web history for the world to see as you browsed.

The data looked bad

We looked at how many unique users were posting to Twitter from our interface. For a start, this number was at most 10% of the total number of new signups that day. And while we did not break down our retention stats down to that level of granularity, the numbers were low enough that we could simply observe by the individual posts that very rarely did any user post more than once from the interface. The low proportion of new users trying it could be explained by the feature as it stood only being applicable to the overlap between Twitter users and WebMynd users. But the terrible retention was more worrying.

The fact that the data looked bad after a couple of months of tracking it was a great big warning light. When thinking about the causes of the poor data, I asked myself the following questions:

Did the feature satisfy the intended use case?

While it allowed users to post a question to Twitter about their search query we had no built-in way to collate the answers. We assumed that the user, if already also a Twitter user, would have their own way to keep alerted on @replies or posts from people they followed, by for example, using TwitterFox. So we didn’t want to replicate those features in WebMynd and we had no automatic way to correlate replies to the question with the question itself.

Also, we observed that many of the users who did try out the feature appeared to be brand new Twitter users by the fact that their post from WebMynd was their first post. These users did not have many, if any followers, and probably had not yet figured out what set of habits and applications allowed them to keep track of replies.

So it seemed quite likely that the feature as it stood was not a complete enough solution.

Did we promote the benefits sufficiently?

We thought that putting the ‘Ask Twitter’ box onto our interface on Google and other major search engines would make it sufficiently high profile for a lot of our users to try out – our users saw that interface many times per day on average. But the data showed this wasn’t the case. We did not make a big splash on our website about the feature since we did not want to distract from our main use case with an experiment. We hoped that people would click on the ‘Ask Twitter’ link to open up the posting interface and so discover the feature and its usefulness.

It became clear to me that in order to really to a good test of the feature and use case, we would have to work much harder on promoting the benefits as well as making our implementation much more complete.

Could a different feature satisfy the use case better?

But that changed when we launched the WebMynd ‘Dock’ earlier this month. This allowed users to share links and post to Twitter as well as several other tools (Facebook, Delicious, Digg, Reddit, Hacker News) from a sidebar in the browser.

Within days of launching that feature, we had thousands of people actively using it, on average, 4 times per day. And while we had launched it for the purpose of sharing links as you browsed, it seemed so easy to also be able to make general posts to Twitter, including asking questions right from that interface.

For me that made the decision – we had another feature which satisfied the use case and more, where the data was showing great up-take by users. There seemed no sense in investing the effort to make the ‘Ask Twitter’ feature more complete and to promote that when the alternative feature was taking off so well.

Did the intended use case exist?

Ultimately it is possible that the ‘Ask Twitter’ feature was not well used simply because the use case simply didn’t exist. Maybe people don’t find the ability to easily ask their Twitter network questions related to their search useful. I don’t think the experiment we ran was sufficient to conclude that but it obviously was not working in the form we had originally tried.

What now?

It’s been about a week since we removed the feature, we have yet to receive a comment or complaint, and there have been no repercussions in terms of the usage of other features. So I think it is fair to conclude that it was the right choice. We should of course ask the 5 whys in our next product roadmap iteration to see what we can learn from the feature and how we might improve our decision making process.

We now have two interfaces into Twitter – the search widget that WebMynd puts on the right-hand side of Google and other search engines when the user selects it. And the Dock which allows users to post to Twitter and share links easily. I think those two features are just asking to be combined in new and interesting ways, and I very much doubt that the ‘Ask Twitter’ feature will our last experiment in this area.

In general, I don’t think we should be afraid of removing features, because if we cannot do that then we will become more reluctant to try out new ideas in the future and run the risk of them not working out. WebMynd did not start out embedding search results on the right-hand side of search engines. But when, almost in a whim, we just stuck some web history results up there to see what would happen, we discovered the most popular feature in our product. It changed our direction as a company.


WebMynd helps you to find, and keep track of, information from sources you value most by personalizing the right-hand side of Google, Yahoo! and Live Search results pages. We’re launching a number of new features today – you can see a demo and download the latest version from webmynd.com.

Embed the sources you most value onto Google and other search engines

WebMynd gets you to the information you need faster by letting you search multiple sources at once, and without needing to change your usual search engine. It also helps you filter the mass of information that is presented to you on a search engine, by grouping the search results by source – you usually know which sources you trust the most to give you the information you want for a particular type of search.

WebMynd now handles over 350K searches per day, personalizing the right-hand side of search engines by aggregating results from sources such as YouTube, Twitter, Wikipedia and Flickr. Since November this feature has been available on the Google results page for users of our Firefox browser extension. It is now available on Yahoo! and Live Search as well, on both Firefox and Internet Explorer.

You can try it out before you install on our demo page.

Keep track of what you’ve found and share it with friends

Once you have search for and found the information you wanted, the last thing you want to do is search for it again when you next need it. Or look for it amongst hundreds of open tabs. Or have to copy and past a link into an email in order to share it with your friends and colleagues.

WebMynd has always allowed you to full-text search your history right on Google and browse your history as a film reel – like a DVR for the Web. And now WebMynd is launching the Dock – a sidebar that shows your recent history as a list and enables you to share the webpage you are on through email, Twitter, Facebook and other tools. This is the first of several social search and browsing features WebMynd will launch in the coming months.

Publishers, other startups and Mozilla recommend WebMynd as a top search tool

WebMynd is recommended by Mozilla and has built up a following of over 150K monthly users. Other startups and publishers value the opportunity to include their search results as a WebMynd source. A number of them, including Fluther, OneRiot and Hacker News are distributing customized versions of WebMynd which show their results on the right-hand side by default as well as allowing all WebMynd users to add their content. WebMynd has custom versions of its search personalization available for major publishers such as Forbes, LA Times, CNN, Daylife and TechCrunch.

If you’re a publisher and think your users would value being able to access your branding content whenever they search, then please get in touch.


I don’t think Google want to use BigTable. I think Google have to use BigTable because of the absurd scale that they’re working at.

Unstructured databases (like Amazon’s SimpleDB and Google’s Data Store – built on BigTable) are great in that they are easy to scale, have an uncomplicated model of how data is stored in them and a simple approach to how that data is queried. These shortcuts and simplifications on the data storage and retrieval are key to the scalability of the databases. The databases have basically been reduced down to being huge distributed hash tables.

Unstructured databases: the downside

Unfortunately, the same simplifications that enable enormous scalability are very punitive in the restrictions they place on us as programmers. When we think about modeling our data, we think that a user HAS-MANY, a manager IS-A employee, a blog comment BELONGS-TO a blog post, and so on. Those relationships are just not representable in unstructured databases; you have to synthesise them yourself in software.

These connections between objects are not merely the result of us all having been conditioned to “think in SQL” over the last 35 years. Rather, these are the real relationships between the actual objects we’re modelling; it was SQL that was designed so that it matched reality, not the other way round.

Your average small to medium startup company does not need to store the entire internet in a database, so unstructured databases burden us with unneeded, inconvenient over-simplifications. As Einstein said:

Things should be as simple as possible, but not simpler.

Unstructured databases: the upside

And yet there is a proto-trend towards using these unstructured databases. The reason is that they are much, much more easy to offer “as a service”, and having someone else managed your database makes a lot sense to startups. Database maintenance is a huge time sink, and small companies should strive to spend as much time as possible on differentiating features and as little time as possible on mundane admin tasks and overhead.

That SimpleDB and Data Store liberates you from the need to be a DBA is a big enough draw for a lot of people to live with the downsides that the unstructured product brings. You don’t need to configure backups, implement master / slave replication, tweak performance parameters, set up sharding – the list goes on and on. The surfeit of “how do I set up replication” questions on MySQL forums are a testament to how easy that is for the inexperienced.

Surely, then, the ideal situation would be a pay-as-you-go, managed database provider with true relational capabilities? WebMynd have been lucky enough to be using just such a database over the last year: FathomDB, which launches in private beta today.

FathomDB: relational and managed

To us, FathomDB just looks like a normal MySQL database. There’s no time wasted figuring out how to convert your data model into a denormalised form, and existing databases can be easily converted to run FathomDB.

However, that normal-looking MySQL database is fully managed, so that we don’t have to worry about backups, monitoring, replication: the very same things that SimpleDB and Data Store relieve you from worrying about.

The time savings have been huge. On the backend, we’ve been able to spend our time working on new search technology without having to worry about database admin tasks at the same time. We’ve been able to focus on valuable features, relevant to our business and company, and been liberated from spending time playing the DBA.

Scaling

As we’ve grown over the last few months, FathomDB have used us a proving tool for their scalability, and are now inserting 5 million new rows every day; that’s 60 rows per second on average, although we spike up to around 100 per second. For comparison, there are 2.8 million documents, total, in the English Wikipedia: being able to handle this scale should be more than sufficient for the vast majority of startups.

FathomDB bills itself as “databases as a service”. The difference with them compared to databases like SimpleDB and Data Store is that is really is a database. All the features of RDBMSes that you know and want are available, with the added benefit of a pay-as-you-go pricing model and a fully managed service.


The seamless integration of doctest, Nose, Sphinx and MiniMock means that taking a little more time to write your Python doc strings can give you testable documentation, full of examples, in HTML or LaTeX markup, and main-line unit test coverage “for free”.

The bon mariage between these agile tools has worked so well for us that when it came to extending test coverage up to 100% using a full Nose test suite, we were really pining for the painless mock objects that MiniMock gives you.

MiniMock works by printing out your code’s actual usage of mock objects so that it can be compared with the expected usage you specify in the doc string. For example, this function reads a URL and writes it to a file-like object:

import urllib
def write_url(url, out_file):
    """
    Example::

        >>> from minimock import mock, Mock
        >>> mock('urllib.urlopen', returns=Mock('urlopen_result'))
        >>> write_url('http://webmynd.com', Mock('out_file'))      #doctest: +ELLIPSIS
        Called urllib.urlopen('http://webmynd.com')
        Called urlopen_result.read()
        Called out_file.write(None)
        <Mock ... out_file>
    """
    page_content = urllib.urlopen(url)
    out_file.write(page_content.read())
    return out_file

The supplied doctest shows a couple of different mocking methods, and also doctest’s invaluable ELLIPSIS option, which allows for fuzzy matching of the expected output.

When writing unit tests for this method, rather than a single simple doctest, there are two problems.

  1. there’s no convenient way to track the usage of MiniMock-ed objects
  2. the fuzzy matching tools in doctest aren’t particularly conveniently exposed for unit test usage

Tracking MiniMock usage

To track the usage of mocked objects, we subclass minimock.Printer to store the console output in a StringIO object, rather than printing it to sys.stdout:

class TraceTracker(Printer):
    def __init__(self, *args, **kw):
        self.out = StringIO()
        super(TraceTracker, self).__init__(self.out, *args, **kw)
        self.checker = doctest.OutputChecker()
        self.options =  doctest.ELLIPSIS
        self.options |= doctest.NORMALIZE_WHITESPACE
        self.options |= doctest.REPORT_UDIFF

    def check(self, want):
        return self.checker.check_output(want, self.dump(),
            optionflags=self.options)

    def diff(self, want):
        return self.checker.output_difference(doctest.Example("", want),
            self.dump(), optionflags=self.options)

    def dump(self):
        return self.out.getvalue()

The check() method uses doctest’s OutputChecker to compare the observed and expected mock usage, while diff() returns a human-readable comparison of the observed and expected mock usage.

The basic idea is to store up the messages MiniMock would have printed in a convenient container, and provide some utilities to interrogate those messages.

Matching MiniMock usage

The TraceTracker class shown above already gives us all the functionality we need – all that is required is a convenient utility function:

def assert_same_trace(tracker, want):
    assert tracker.check(want), tracker.diff(want)

This function allows us to check the mock objects are being used as we expected, and prints out a human-readable diff of the expected and observed usage if applicable.

Usage Example

As a concrete example, I’ll convert the doctest for the write_url function to a Nose-style unit test:

def test_write_url():
    tt = TraceTracker()
    mock('urllib.urlopen', returns=Mock('urlopen_result', tracker=tt), tracker=tt)
    write_url('http://webmynd.com', Mock('out_file', tracker=tt))

    expected_output = """Called urllib.urlopen('http://webmynd.com')
Called urlopen_result.read()
Called out_file.write(None)"""
    assert_same_trace(tt, expected_output)

The definition of the expected MiniMock usage (called expected_output here) can feel a little clunky, but in our experience, these definitions are quite often common between test cases, so can be defined once and shared.

MiniMock is great for quickly faking out fairly complex external dependencies, with little, if any, compromise on the rigour of your tests. By adapting its usage for unit tests, as described here, you can have all that convenience and power in your more exhaustive test suites.

The code given above is available as MiniMockUnit on PyPI.


To add to the customized WebMynd that we created for Fluther we now have versions available for OneRiot and Hacker News as well. This means there are more ways for users to discover WebMynd and more examples of how WebMynd can help publishers make their content more useful for their users by giving their search results a persistent presence on Google and other search engines.

If you are a WebMynd user and have suggestions on what other search sources you would like to be included, or if you are a publisher interested in a customized version of WebMynd, then get in touch!


Starting with Sphinx version 0.5, you can now control and launch your documentation builds from within the warm fuzzy world of setuptools!

Run:

python setup.py --help-commands

inside your setuptools project, and if you see a build_sphinx target in the “Extra commands” section, you’re in luck.

The Sphinx build can be configured from your setup.cfg in the same directory. Here are the available options (taken from here):

fresh-env: Discard saved environment
all-files: Build all files
source-dir: Source directory
build-dir: Build directory
builder: The builder to use. Defaults to “html”

For reference, here’s the relevant part of setup.cfg from one of our projects:

[build_sphinx]
source-dir = docs/source
build-dir  = docs/build
all_files  = 1

Note the lack of quotes around the directories – I found that including quotes confused the command.

For large bodies of code, configuration can become fragmented and messy extremely quickly unless you’re very careful; little features like this can really help centralise your configuration, and keep you sane. Pre-requisites, source/binary distributions, unit-tests, documentation and distribution to PyPI all configured through one tool? Yes please!