Archive Page 3
Take control of your searching with the WebMynd browser extension for Firefox and Internet Explorer
WebMynd helps you to find, and keep track of, information from sources you value most by personalizing the right-hand side of Google, Yahoo! and Live Search results pages. We’re launching a number of new features today – you can see a demo and download the latest version from webmynd.com.
Embed the sources you most value onto Google and other search engines
WebMynd gets you to the information you need faster by letting you search multiple sources at once, and without needing to change your usual search engine. It also helps you filter the mass of information that is presented to you on a search engine, by grouping the search results by source – you usually know which sources you trust the most to give you the information you want for a particular type of search.
WebMynd now handles over 350K searches per day, personalizing the right-hand side of search engines by aggregating results from sources such as YouTube, Twitter, Wikipedia and Flickr. Since November this feature has been available on the Google results page for users of our Firefox browser extension. It is now available on Yahoo! and Live Search as well, on both Firefox and Internet Explorer.
You can try it out before you install on our demo page.
Keep track of what you’ve found and share it with friends
Once you have search for and found the information you wanted, the last thing you want to do is search for it again when you next need it. Or look for it amongst hundreds of open tabs. Or have to copy and past a link into an email in order to share it with your friends and colleagues.
WebMynd has always allowed you to full-text search your history right on Google and browse your history as a film reel – like a DVR for the Web. And now WebMynd is launching the Dock – a sidebar that shows your recent history as a list and enables you to share the webpage you are on through email, Twitter, Facebook and other tools. This is the first of several social search and browsing features WebMynd will launch in the coming months.
Publishers, other startups and Mozilla recommend WebMynd as a top search tool
WebMynd is recommended by Mozilla and has built up a following of over 150K monthly users. Other startups and publishers value the opportunity to include their search results as a WebMynd source. A number of them, including Fluther, OneRiot and Hacker News are distributing customized versions of WebMynd which show their results on the right-hand side by default as well as allowing all WebMynd users to add their content. WebMynd has custom versions of its search personalization available for major publishers such as Forbes, LA Times, CNN, Daylife and TechCrunch.
If you’re a publisher and think your users would value being able to access your branding content whenever they search, then please get in touch.
Filed under: Uncategorized | 7 Comments
Databases as a service: FathomDB
I don’t think Google want to use BigTable. I think Google have to use BigTable because of the absurd scale that they’re working at.
Unstructured databases (like Amazon’s SimpleDB and Google’s Data Store – built on BigTable) are great in that they are easy to scale, have an uncomplicated model of how data is stored in them and a simple approach to how that data is queried. These shortcuts and simplifications on the data storage and retrieval are key to the scalability of the databases. The databases have basically been reduced down to being huge distributed hash tables.
Unstructured databases: the downside
Unfortunately, the same simplifications that enable enormous scalability are very punitive in the restrictions they place on us as programmers. When we think about modeling our data, we think that a user HAS-MANY, a manager IS-A employee, a blog comment BELONGS-TO a blog post, and so on. Those relationships are just not representable in unstructured databases; you have to synthesise them yourself in software.
These connections between objects are not merely the result of us all having been conditioned to “think in SQL” over the last 35 years. Rather, these are the real relationships between the actual objects we’re modelling; it was SQL that was designed so that it matched reality, not the other way round.
Your average small to medium startup company does not need to store the entire internet in a database, so unstructured databases burden us with unneeded, inconvenient over-simplifications. As Einstein said:
Things should be as simple as possible, but not simpler.
Unstructured databases: the upside
And yet there is a proto-trend towards using these unstructured databases. The reason is that they are much, much more easy to offer “as a service”, and having someone else managed your database makes a lot sense to startups. Database maintenance is a huge time sink, and small companies should strive to spend as much time as possible on differentiating features and as little time as possible on mundane admin tasks and overhead.
That SimpleDB and Data Store liberates you from the need to be a DBA is a big enough draw for a lot of people to live with the downsides that the unstructured product brings. You don’t need to configure backups, implement master / slave replication, tweak performance parameters, set up sharding – the list goes on and on. The surfeit of “how do I set up replication” questions on MySQL forums are a testament to how easy that is for the inexperienced.
Surely, then, the ideal situation would be a pay-as-you-go, managed database provider with true relational capabilities? WebMynd have been lucky enough to be using just such a database over the last year: FathomDB, which launches in private beta today.
FathomDB: relational and managed
To us, FathomDB just looks like a normal MySQL database. There’s no time wasted figuring out how to convert your data model into a denormalised form, and existing databases can be easily converted to run FathomDB.
However, that normal-looking MySQL database is fully managed, so that we don’t have to worry about backups, monitoring, replication: the very same things that SimpleDB and Data Store relieve you from worrying about.
The time savings have been huge. On the backend, we’ve been able to spend our time working on new search technology without having to worry about database admin tasks at the same time. We’ve been able to focus on valuable features, relevant to our business and company, and been liberated from spending time playing the DBA.
Scaling
As we’ve grown over the last few months, FathomDB have used us a proving tool for their scalability, and are now inserting 5 million new rows every day; that’s 60 rows per second on average, although we spike up to around 100 per second. For comparison, there are 2.8 million documents, total, in the English Wikipedia: being able to handle this scale should be more than sufficient for the vast majority of startups.
FathomDB bills itself as “databases as a service”. The difference with them compared to databases like SimpleDB and Data Store is that is really is a database. All the features of RDBMSes that you know and want are available, with the added benefit of a pay-as-you-go pricing model and a fully managed service.
Filed under: technical | 11 Comments
The seamless integration of doctest, Nose, Sphinx and MiniMock means that taking a little more time to write your Python doc strings can give you testable documentation, full of examples, in HTML or LaTeX markup, and main-line unit test coverage “for free”.
The bon mariage between these agile tools has worked so well for us that when it came to extending test coverage up to 100% using a full Nose test suite, we were really pining for the painless mock objects that MiniMock gives you.
MiniMock works by printing out your code’s actual usage of mock objects so that it can be compared with the expected usage you specify in the doc string. For example, this function reads a URL and writes it to a file-like object:
import urllib
def write_url(url, out_file):
"""
Example::
>>> from minimock import mock, Mock
>>> mock('urllib.urlopen', returns=Mock('urlopen_result'))
>>> write_url('http://webmynd.com', Mock('out_file')) #doctest: +ELLIPSIS
Called urllib.urlopen('http://webmynd.com')
Called urlopen_result.read()
Called out_file.write(None)
<Mock ... out_file>
"""
page_content = urllib.urlopen(url)
out_file.write(page_content.read())
return out_file
The supplied doctest shows a couple of different mocking methods, and also doctest’s invaluable ELLIPSIS option, which allows for fuzzy matching of the expected output.
When writing unit tests for this method, rather than a single simple doctest, there are two problems.
- there’s no convenient way to track the usage of MiniMock-ed objects
- the fuzzy matching tools in doctest aren’t particularly conveniently exposed for unit test usage
Tracking MiniMock usage
To track the usage of mocked objects, we subclass minimock.Printer to store the console output in a StringIO object, rather than printing it to sys.stdout:
class TraceTracker(Printer):
def __init__(self, *args, **kw):
self.out = StringIO()
super(TraceTracker, self).__init__(self.out, *args, **kw)
self.checker = doctest.OutputChecker()
self.options = doctest.ELLIPSIS
self.options |= doctest.NORMALIZE_WHITESPACE
self.options |= doctest.REPORT_UDIFF
def check(self, want):
return self.checker.check_output(want, self.dump(),
optionflags=self.options)
def diff(self, want):
return self.checker.output_difference(doctest.Example("", want),
self.dump(), optionflags=self.options)
def dump(self):
return self.out.getvalue()
The check() method uses doctest’s OutputChecker to compare the observed and expected mock usage, while diff() returns a human-readable comparison of the observed and expected mock usage.
The basic idea is to store up the messages MiniMock would have printed in a convenient container, and provide some utilities to interrogate those messages.
Matching MiniMock usage
The TraceTracker class shown above already gives us all the functionality we need – all that is required is a convenient utility function:
def assert_same_trace(tracker, want):
assert tracker.check(want), tracker.diff(want)
This function allows us to check the mock objects are being used as we expected, and prints out a human-readable diff of the expected and observed usage if applicable.
Usage Example
As a concrete example, I’ll convert the doctest for the write_url function to a Nose-style unit test:
def test_write_url():
tt = TraceTracker()
mock('urllib.urlopen', returns=Mock('urlopen_result', tracker=tt), tracker=tt)
write_url('http://webmynd.com', Mock('out_file', tracker=tt))
expected_output = """Called urllib.urlopen('http://webmynd.com')
Called urlopen_result.read()
Called out_file.write(None)"""
assert_same_trace(tt, expected_output)
The definition of the expected MiniMock usage (called expected_output here) can feel a little clunky, but in our experience, these definitions are quite often common between test cases, so can be defined once and shared.
MiniMock is great for quickly faking out fairly complex external dependencies, with little, if any, compromise on the rigour of your tests. By adapting its usage for unit tests, as described here, you can have all that convenience and power in your more exhaustive test suites.
The code given above is available as MiniMockUnit on PyPI.
Filed under: technical | Closed
Tags: agile, minimock, python, testing, unit test
To add to the customized WebMynd that we created for Fluther we now have versions available for OneRiot and Hacker News as well. This means there are more ways for users to discover WebMynd and more examples of how WebMynd can help publishers make their content more useful for their users by giving their search results a persistent presence on Google and other search engines.
If you are a WebMynd user and have suggestions on what other search sources you would like to be included, or if you are a publisher interested in a customized version of WebMynd, then get in touch!
Filed under: Uncategorized | Closed
Starting with Sphinx version 0.5, you can now control and launch your documentation builds from within the warm fuzzy world of setuptools!
Run:
python setup.py --help-commands
inside your setuptools project, and if you see a build_sphinx target in the “Extra commands” section, you’re in luck.
The Sphinx build can be configured from your setup.cfg in the same directory. Here are the available options (taken from here):
fresh-env: Discard saved environment
all-files: Build all files
source-dir: Source directory
build-dir: Build directory
builder: The builder to use. Defaults to “html”
For reference, here’s the relevant part of setup.cfg from one of our projects:
[build_sphinx] source-dir = docs/source build-dir = docs/build all_files = 1
Note the lack of quotes around the directories – I found that including quotes confused the command.
For large bodies of code, configuration can become fragmented and messy extremely quickly unless you’re very careful; little features like this can really help centralise your configuration, and keep you sane. Pre-requisites, source/binary distributions, unit-tests, documentation and distribution to PyPI all configured through one tool? Yes please!
Filed under: technical | Closed
Tags: distutils, documentation, python, setuptools, sphinx
We use ConfigObj configuration files pretty extensively at WebMynd; it would be nice to use the ConfigParser module available in Python’s standard library, but the extra features ConfigObj has, such as lists, multi-line strings and nested sections, make it hard to say no to the richer library…
Unfortunately, TextMate doesn’t come with support for ConfigObj syntax, but the editor’s excellent Bundle Editor allowed me to fix that pretty easily.
Here is an example ConfigObj file as I see it in TextMate, with two different “Font & Color” schemes:
TextMate language definitions use regular expressions to categorise text in a file (into keywords, constants, variables and so on). The regexes I’ve put together for this ConfigObj bundle are somewhat fragile – if you try to break it you probably will.
However, it should be good enough for the majority of configuration in the majority of files. As an added bonus, ConfigObj syntax is a superset of INI syntax, so you get the full poly-chromatic experience in .cfg and .ini files alike!
If you’re a TextMate user, download this file, unzip it and double-click on ConfigObj.tmbundle.
Filed under: technical | Closed
Tags: configobj, ini, textmate
Today we launched our first customized search enhancer in collaboration with Fluther. It offers all the usual WebMynd features but is especially applicable to Fluther users and to those who want to tap into their networks’ and others’ knowledge through their question and answer service. Fluther are distributing the extension on their own homepage and have written about it here.
This is the first example of others taking advantage of WebMynd’s personalized search interface on Google (and other search engines shortly) to better deliver their service to the their users.
If you would like WebMynd to include your content or service, please get in touch.
Filed under: Firefox Extension, News | 1 Comment
WebMynd personalizes your search with the information sources you most value in the places that you expect. For the moment that means we embed search results from sources such as Twitter, Amazon, YouTube, Flickr, Wikipedia, your web history, your top sites and others on the right hand side of Google and let you configure it.
But we know there are many other sources of information that you may use that we do not yet include. We’d love to hear from you with suggestions on what other sources you would find useful. Or if you’re a site-owner who has unique content that we should include for our users. Just email us anytime at: founders@webmynd.com with your suggestions
We’ve also just released an update of WebMynd with more configuration options, a better UI for changing record modes and much better performance. So if you already use WebMynd be sure to download this latest version.
Filed under: Uncategorized | 3 Comments
We have just released a version of WebMynd which takes us beyond visual web history which we described as a ‘DVR for the web’ when we launched in January. You can download and try out the update now.
It includes a completely re-designed Google interface with aggregation of many search tools such as Flickr, Wikipedia, Twitter search, Linkedin and many more. Many of the sources offer results that are not usually surfaced by Google. If you can’t find what you’re looking for by searching, WebMynd lets you post to Twitter to ask help from your network right from the search results page. As well as aggregating different search tools, WebMynd uses your web history to improve your search by showing you results from ‘Your Top Sites’ namely the sites that you most frequently visit – this is powered by Yahoo! BOSS.
We’d love to hear what you think and get your suggestions on other search tools to include.
WebMynd currently supports Firefox 3 on Windows, MacOS and Linux.
Filed under: Uncategorized | 1 Comment
Tags: webmynd twitter release firefox extension google search
Scaling on EC2
Like any application developed for a platform, the success of a Firefox Add-on is closely tied to the popularity and distribution you get from the underlying delivery mechanism. So, when we honed down the WebMynd feature set, improving the product enough to get on Mozilla’s Recommended List, we were delighted by our increasing user numbers. A couple of weeks later, Firefox 3 was released, and we got a usage graph like this:
With a product like WebMynd, where part of the service we provide is to save and index a person’s web history, this sort of explosive expansion brings with it some growing pains. Performance was a constant battle with us, even with the relatively low user numbers of the first few months. This was due mainly to some poor technology choices; thankfully, the underlying architecture we chose from the start has proven to be sound.
I would not say that we have completely solved the difficult problem in front of us – we are still not content with the responsiveness of our service, and we’re open about the brown-outs we still sometimes experience – but we have made huge progress and learned some invaluable lessons over the last few months.
What follows is a high level overview of some of the conclusions we’ve arrived at today, best practices that work for us and some things to avoid. In later weeks, I plan to follow up with deeper dives into certain parts of our infrastructure as and when I get a chance!
Scaling is all about removing bottlenecks
This sounds obvious, but should strongly influence all your technology and architecture decisions.
Being able to remove bottlenecks means you need to be able to swap out discrete parts which aren’t performing well enough, and swap in bigger, faster, better parts which will perform as required. This will move the bottleneck somewhere else, at which point you need to swap out discrete parts which aren’t performing well enough, and swap in bigger, faster, better parts… well you get the idea. This cycle can be repeated ad infinitum until you’ve optimised the heck out of everything and you’re just throwing machines at the problem.
At WebMynd, for our search backend, we’ve done this four or five times already in the five months we’ve been alive, and I think I still have some iterations left in me. Importantly, I wouldn’t say that any of these iterations were a mistake. In a parallel to the Y Combinator ethos of launching a product early, scaling should be an iterative process with as close a feedback loop as possible. Premature optimisation of any part of the service is a waste of time and is often harmful.
Scaling relies on having discrete pieces with clean interfaces, which can be iteratively improved.
Horizontal is better than vertical
One of the reasons Google triumphed in the search engine wars was that their core technology was designed from the ground up to scale horizontally across cheap hardware. Compare this with their competitors’ approach, which was in general to scale vertically – using larger and larger monolithic machines glued together organically. Other search engines relied on improving hardware to cope with demand, but when the growth of the internet outstripped available hardware, they had nowhere to go. Google was using inferior pieces of hardware, but had an architecture and infrastructure allowing for cheap and virtually limitless scaling.
Google’s key breakthroughs were the Google File System and MapReduce, which together allow them to horizontally partition the problem of indexing the web. If you can architect your product in such a way as to allow for similar partitioning, scaling will be all the more easy. It’s interesting to note that some of the current trends of Web2.0 products are extremely hard to horizontally partition, due to the hyper-connectedness of the user graph (witness Twitter).
The problem WebMynd is tackling is embarrassingly partitionable. Users have their individual slice of web history, and these slices can be moved around the available hardware at will. New users equals new servers.
Hardware is the lowest common denominator
By running your application on virtual machines using EC2, you are viewing the hardware you’re running on as a commodity which can be swapped in and out at the click of a button. This is an useful mental model to have, where the actual machine images you’re running on are just another component in your architecture which can be scaled up or down as demand requires. Obviously, if you’re planning on scaling horizontally, you need to be building on a substrate which has low marginal cost for creating and destroying hardware – marginal cost in terms of time, effort and capex.
A real example
To put the above assertions into context, I’ll use WebMynd’s current architecture:
The rectangles represent EC2 instances. Their colour represents their function. The red arrow in the top right represents incoming traffic. Other arrows represent connectedness and flows of information.
This is a simplified example, but here’s what the pieces do in general terms:
- All traffic is currently load balanced by a single HAProxy instance
- All static content is served from a single nginx instance (with a hot failover ready)
- Sessions are distributed fairly across lots of TurboGears application servers, on several machines
- The database is a remote MySQL instance
- Search engine updates are handled asynchronously through a queue
- Search engine queries are handled synchronously over a direct TurboGears / Solr connection (not shown)
One shouldn’t be timid in trying new things to find the best solution; almost all of these parts have been iterated on like crazy. For example, we’ve used Apache with mod_python, Apache with mod_proxy, Apache with mod_wsgi. We’ve used TurboLucene, looked very hard at Xapian, various configurations of Solr.
For the queue, I’ve written my own queuing middleware, I’ve used ActiveMQ running on an EC2 instance and I’m now in the process of moving to Amazon’s SQS. We chose to use SQS as although ActiveMQ is free as in beer and speech, it has an ongoing operations cost in terms of time, which is one thing you’re always short of during hyper-growth.
The two parts which are growing the fastest are the web tier (the TurboGears servers) and the search tier (the Solr servers). However, as we can iterate on our implementations and rapidly horizontally scale on both of those parts, that growth has been containable, if not completely pain free.
Amazon’s Web Services give growing companies the ideal building blocks to scale and keep up with demand. By iteratively improving the independent components in our architecture, we have grown to meet the substantial challenge of providing the WebMynd service to our users.
Filed under: Website, technical | 23 Comments
Tags: ec2, haproxy, nginx, performance, python, scaling, solr, turbogears, webmynd

