Interview with Brave Search

May 6, 2022Last updated: May 6, 2022

#search#interview

Most search engines are not independent search engines, and while they may provide some value, they are qualitatively different from what Brave Search is doing. Independence is not something directly actionable, but it’s a fundamental property.

This is an interview with Josep M. Pujol, the Chief of Search at Brave.

DKB: What’s the origin story of Brave Search?

Brave was always interested in search. After all, browser and search go together – one is the car, and the other is the driving wheel. The vast majority of people conflate them both.

Search is extremely complex and costly to develop so Brave did not attack the problem at first. The option, chosen by many other search engines, of using Google or Bing APIs and applying their branding with some ranking alterations was not appealing to Brave because it does not solve the problem of search, it just hides the dirt under the carpet.

In parallel, there was a German-based company called Cliqz, which was building search within a browser. They had been building a search engine from scratch since 2014. Brave and Cliqz, although competitors, always had a good relationship because of a shared vision: to protect privacy and offer real alternatives to Big Tech.

When Cliqz funding ran out in 2020, the search part became a spin-off as Tailcat, and shortly after they joined Brave in March 2021.

DKB: What makes Brave Search different from Google and other search engines?

Brave Search is independent and privacy-preserving. There is no tracking or profiling of users. The Brave search index relies heavily on anonymous contributions from users, who can opt into the WDP (Web Discovery Project) in the Brave browser.

The WDP helps with ranking, but the biggest contribution to ranking is the index itself, which is intentionally smaller than that of Google or Bing. We do not aim to index the whole Web, rather only the Web that is worth indexing.

The biggest problem of search engines is noise-reduction. As with other complex machine learning systems, they suffer from the garbage-in, garbage-out problem.

Google puts all the effort in trying to minimize the garbage-out by using very sophisticated models, trained with data. That’s a very good approach, but it’s something that we cannot replicate, not only because privacy-sensitive data is a no-go for Brave, but also because it requires a lot of resources.

Brave’s approach to reduce the garbage-out is to be careful with what is being ingested, and that makes the algorithmic part of recall and rankings less resource-intensive.

Of course it has a caveat, which is that Brave Search is not yet as good as Google in recovering long-tail queries. At some point we will be, but as of today, that is the trade-off we took.

The difference with other search engines such as DuckDuckGo, You, Neeva, Kagi, Startpage, Presearch, etc. is pretty straightforward: Brave Search can operate as stand-alone, the rest cannot as they rely on Google or Bing.

Most search engines are not independent search engines, and while they may provide some value, they are qualitatively different from what Brave Search is doing. Independence is not something directly actionable, but it’s a fundamental property.

Independence means that Brave Search would continue to work even if Google and Microsoft opposed it. Independence means choice and diversity: if results are drawn from the same provider, they are inherently limited by that provider.

Independence is freedom to do as we see fit and to own our mistakes. If we were to censor Russia Today or CNBC, which we wouldn’t, it would be our choice, not our provider’s decision.

DKB: Many people seem to believe that Google’s results are deteriorating in quality and filled with SEO-optimized content. What are you doing differently when it comes to quality and ranking to avoid this?

We assess quality in two ways: one is the set of anonymous (not anonymized, important difference) signals provided by people who opted into the Web Discovery Project. The second is through human assessment, a small team that is growing as we speak.

Both options, at different scales, evaluate results (to tune up ranking) and pages (to be indexed). The goal is not to rate individual pages or results, but to annotate data that is later used by ML. There is no silver bullet here.

Regarding the problem of deteriorating quality due to spam content, we try to keep it at bay by making sure we index only the relevant subset of the Web. Our index is much smaller than the Web, as a matter of fact it is much smaller than what we could crawl. We index less than 10% of the URLs we are aware of.

We make a conscious decision to index pages only where we have some evidence of usefulness. A small index helps to reduce the garbage-in/garbage-out problem and it’s also more economical to run (still massively expensive).

One way to see it is that we are sacrificing the number of candidate pages to be ranked in return for less noisy results. This is a sensible trade-off, because content also follows the typical Pareto distribution, with 80% of the signal on 20% of the pages.

In any case, the problem of spam, SEO, etc. — let’s call it adversarial content — is indeed a big issue. We might be suffering less than Google by design, but we are not free of it.

Search today is more difficult than it was 6 years ago; infrastructure-wise, we are all better off, but the number of pages keeps increasing and the majority of them are of questionable quality.

In addition to that, not all content is accessible, and to make things worse, automation on content creation is rampant.

I’m not saying “it’s all spam” like the cartoon in your article (pretty good one by the way!), but Google has a fair point, the level of "spam" is staggering. That said, they brought that upon themselves and upon us.

The barrage of crap we see on the Web is a direct consequence of the current economic model, where Google sucks all the money from the space, leaving only spoils to content creators. If there is no good money to be made on a single article, people will be forced to make up for this with 100 worse ones, exacerbating the problem.

This is true about sponsored content, but sadly also true for journalistic content. By acting as a toll-booth, Google (and other Big Tech) is killing the Web by pauperizing the content providers and co-opting the users.

One good example of those are instant answers: “weather tomorrow,” “how to fold a shirt,” “u2 bono height”. All those queries should send traffic to the sites, which then could be monetized. By providing an instant answer, Google deprives them of traffic.

As a matter of fact, Brave also has those instant answers, users demand it! What to do? The only option is to change the monetization model on the Web, with the current one being clearly unfair and favoring only opportunistic and predatory behaviors. This change is not going to come from the incumbents.

DKB: Does Brave Search use its own index, or is it relying on Google/Bing for some queries?

Yes, Brave Search has its own index, about +10B pages as of today.

Despite having our own index, we rely on Bing and Google in cases where we detect that our results are not good or complete. On the server side, we rely on the Bing API, and on the client-side (opt-in, Brave browser only), we rely on Google. The final mix of results, which we call the “independence score,” is 92 as of today, which means that 92% of the results come from our index and 8% from 3rd parties (more about our independence score here).

In June 2021, we started with 87, dropped to 85 in the following month due to traffic and a large fraction of non-English speaking users, but crawled back to 90 after a couple of months. Today we are at 92 and it’s sensible to think we will achieve 95 within the next 18 months. We are aiming for 100 eventually.

Note that the “independence score” is a knob that we can tune, it does not mean that Brave search can only answer 92% of the queries. It can answer 99.9% of them, but we (really, our ML models) decide that results could be better so we pull from 3rd parties. This mixing is needed, otherwise the users could face consistent bad results which would lead to churn. Search is a utility, and it has to work all the time, for all types of queries.

That’s why search engines like Mojeek and Alexandria, which have their own index and do not rely on 3rd parties whatsoever, have and probably will continue to have very little traffic. It’s a chicken-and-egg problem, which can be solved only by 3rd party mixing, which will help bootstrap a user-base, and in time will improve the owned and operated search engine toward 100% independence.

As far as we know, Brave Search is the only search engine doing this sort of mixing. Some others claim to do so (DuckDuckGo, Presearch, Startpage, Neeva, Swisscows, You, Kagi, the list is very long and getting longer every week), but they are not very transparent on how they operate.

It does not take a lot of effort to see that their reliance on 3rd parties is absolute – despite claims. The real test is simple: if Bing or Google were to cancel your API access, would you be able to continue operating? If any of the aforementioned say yes, then they are lying, they would not be able to.

Brave Search is not yet 100% independent either, but it can be, and it would continue to work even if all other search engines disappeared. Results would suffer a bit for that 8% mentioned above, but people would still be able to search.

That’s the independence we aim for, and as far we know only the big players (Google, Bing), regional champions (Yandex, Baidu, Ceznam), and self-sufficient search engines (Mojeek) have it.

DKB: Do you think ads as a business model is ruining the web?

First and foremost, online advertisement implementation as of today is a privacy nightmare. This is a fact that should require no further discussion. It should disappear sooner rather than later, that is what Brave (browser) is trying to do with its privacy-preserving ad system that also rewards users who opt in.

The model, however, is still valid. If privacy is preserved, ads just become a way of sponsoring: one either pays for the service with their attention (ads) or with money (subscription). Those two models are not mutually exclusive in any way.

The whole argument of “if you are not paying for the product then you are the product”, is problematic when the product is your privacy, because it’s a dangerous currency. It might be cheap now but can be very expensive in the future.

Attention, however, is time-bounded and well understood by whoever is paying. Ads of course will always be a nuisance, and there is a tendency to show more and more ads. The solution to that is competition.

The fact that under Sridhar’s watch (CEO of Neeva), Google increased the amount of ads beyond reasonable limits is caused by Google’s greed, of course, but also because of the lack of choices (of real choices, not theoretical ones, Google is not one-click away).

DKB: What’s your take on the whole “appending reddit to search queries” situation and what it means?

Appending Reddit to every query would not work for the majority of searches, however, it works very well on certain types of queries. There is a tendency, for myself included, to understate the large number of use-cases a search engine covers, and it has to cover them all.

People expect search engines to be a one-stop shop for all their needs – the driving wheel in the car following the previous analogy. So, no, Reddit would not be a good search engine based solely on their content.

They would cover certain types of queries, probably better than Google as you pointed out, but search is not just exploratory; in fact, the majority of queries have no exploratory intent. Reddit could be a vertical search engine, but a general purpose search engine is not an aggregation of N verticals.

DKB: You recently launched Discussions which allows users to get results from discussion forums like Reddit, Stack Exchange, and even smaller niche sites.

Can you talk more about that? Are you indexing every forum or only high quality / “trustworthy” forums?

We do not have, nor want to have, a taxonomy of good/bad forums. That is one step away from editorializing, and not a path we want to walk down.

Brave Search crawler and the subsequent indexing is governed mostly by the Web Discovery Project (WDP), so we are basically discovering and indexing based on popularity. Forums that no user in the WDP ever visits will not be part of our index.

We do not crawl, or to be more precise, we do not aggressively crawl. As previously mentioned, we want to minimize the garbage-in-garbage-out problem, so indexing the whole Web is not the goal, we prefer to limit it to the “relevant” Web.

Also, please consider that Discussions just launched, there are still many issues that can affect results. Just to give one example, at launch no forum powered by Discourse was classified as a Discussion, so they never showed up. That was just a bug, which we have already fixed, and Discussions is now able to detect and surface forums powered by Discourse.

DKB: There’s also an issue where a lot of discussions are happening in walled gardens / private communities like Facebook groups and Discord chats. Are you thinking about tackling these?

We cannot tackle those without the permission/collaboration of such walled gardens. Of course we would be happy to explore the possibilities if they occur.

Even if we had access there would be some complications though; for instance, surfacing Facebook group results would be great, but would logging into Facebook be required to see the content? It’s a predicament.

It’s not an entirely new problem, it already exists on sites like Pinterest, that requires login to get the proper content, or with The New York Times, which requires a subscription. Showing results (URLs) that require further actions down the road such as paying, subscribing, or logging in, is sub-optimal.

DKB: There was some criticism on Hacker News where people said you can’t trust random people on Reddit or any of these random forums, and they’d prefer to get expert opinions.

Of course, whether people trust a given forum is subjective, but have you thought about this idea of making it easy to get “expert opinions”?

That’s a great but difficult question. You already said it, it’s quite a subjective matter: there would be no consensus on what an expert opinion is, or whether an expert equates with impartiality. That’s an extremely difficult problem, for which there is no technical solution. We have always had this problem in society, the only way to deal with it is by having freedom and choice.

We aim to be as “neutral” as possible, relying mostly on popularity and recency, which of course have biases of their own, but they are the best amongst the bad. For particular needs, such as the one you describe, we believe that our Goggles project is the answer. Different communities of interest can create customized rankings, which could also affect Discussions. Time will tell if our belief is right or if we have to explore different solutions to cater to more particular interests.

One last comment about Discussions: It seems that Google might be testing a similar feature, and that they are rushing to do this kind of thing because real competition is entering the market.

Neeva, DDG, or Startpage, cannot be real competition, even though they could have nice features worth copying. Google and Microsoft will never be scared of them because they can be strangled at any time, simply by limiting or throttling access to their APIs.

A tenant cannot really challenge their landlord, and that's why at Brave we went with the long route of building an independent index.

DKB: Can you talk more about Goggles? It sounds like a promising concept.

Also is this something we can expect soon, or is it more of a long-term idea?

Goggles will allow multiple rankings to fit everyone’s needs. Think of personalization, but without losing privacy or suffering other problems that tracking-based personalization causes.

Why is it needed? Because if there is choice, there are biases. Any ranking or selection has some biases embedded in it. Which features are built, what data is used, time, popularity, all those have biases.

There is no neutral search — however, not neutral does not mean editorialized. Biases due to the nature of the data or technical choices are acceptable (as unavoidable), explicit biases based on editorials or personal beliefs are not (and should be avoided).

If the majority of people favor a particular school of thought, search ranking will reflect that. To counter this, we believe it’s wiser to leave groups of interests defining their own “re-rankers” so that they can benefit from our index.

We judge that group level personalization is better than individual level personalization because it’s explicit. The user must choose it, so the user can at least reason outside their echo chamber by removing the Goggle. Goggles will be publicly inspectable and collaborative, so they will have a reputation as trustworthy or not, independent of our personal opinion.

Goggles, we hope, will allow use cases that we can’t even think of, or that if we did, we could not offer to the entire user base. In an ideal world, those people should go and create their own search engine, but that is an extremely expensive proposition. Goggles bridges that by giving access to a large set of pages to which complex re-ranking functions can be applied.

It will not be perfect, but we believe it’s going to be a good step forward from where we are. Everybody looking at the world from the same window makes us poorer.

The aim is to release the first version of it around our first anniversary; it will not be feature-complete, but the basic components should be there on time (no promises though).

DKB: Is there anything else you want to mention about why people should use Brave Search?

Do you care about:

Privacy
The pernicious effects that Google and Big Tech have on the Web
Answering your query quickly

If you care about all of those 3 things, Brave Search is the only option there is. As of today, there is no one else in this position, though we hope one day there will be more.

If you are willing to sacrifice (c) then you could also use Mojeek.

If you are willing to sacrifice (b) there are plenty of options that act as private proxies to big search engines, such as Startpage (Google), DuckDuckGo (Bing), etc. It’s difficult to be a threat when your system depends on the incumbent you want to topple.

If you are willing to sacrifice (a) and (b) use Google. As of today, despite major issues, it’s still the one that does (c) better than anyone else on aggregate.