Monday, October 24, 2005

Google Censorship - How It Works

Google Censorship - How It Works

Google Censorship - How It Works

An anticensorware investigation by Seth Finkelstein

Abstract: This report describes the system by which results in the Google search engine are suppressed.

Google Exclusion, introduction

Google is arguably the world's most popular search engine. However, contrary perhaps to a naive impression, in some cases the results of a search are affected by various government-related factors. That is, search results which may otherwise be shown, are deliberately excluded. The suppression may be local to a country, or global to all Google results.

This removal of results was first documented in a report Localized Google search result exclusions by Benjamin Edelman and Jonathan Zittrain , which investigated certain web material banned in various countries. Later, this author Seth Finkelstein discussed a global removal arising from intimidation generated from the United Kingdom town of Chester, in Chester's Guide to Molesting Google .

My discussion here is not meant to criticize Google's behavior in any way. Much of it is in reaction to government law or government-backed pressure, where accommodation is an understandable reaction if nothing else. Rather, documenting and explaining what happens, can inform public understanding, and lead to more informed resistance against the distortion of search results created by censorship campaigns.

How it works

A Google search is not simply a raw dump of a database query to the user's screen. The retrieval of the data is just one step. There is much post-processing afterwards, in terms of presentation and customization.

When Google "removes" material, often it is still in the Google index itself. But the post-processing has removed it from any results shown to the user. This system can be applied, for quality reasons, to remove sites which "spam" the search engine. And that is, by volume, certainly the overwhelming application of the mechanism. But it can also be directed against sites which have been prohibited for government-based reasons.

Sometimes the fact that the "removed" material is still in the index can be inferred.

Global censorship

For the case of Chester , which concerned a single "removed" page, the internal indexing of the target page could be established by comparison with a search for the same material on another search engine.

Consider a Google search for the word "lesbian" on the site torkyarkisto.marhost.com . It returns a page titled "The Kurt Cobain Quiz", with a count of

Results 1 - 1 of about 2

The "about" qualifier there represents many factors, but sometimes encompasses blacklisted pages. This can be seen here by comparing to an AltaVista search for the word "lesbian" on the site torkyarkisto.marhost.com

There are two pages visible in that case, the "Quiz" page, and the "Chester" page which caused all the trouble in the first place.

Since we know the "Chester" page was once in the Google index, it must be the other page referred to in "about 2". QED.

Local censorship

In this situation, comparing results from the different Country Google searches, is often revealing. The tests are often best done using the "allinurl:" syntax of Google, which searches for URLs which have the given components (note the separate components can appear anywhere in the URL, so "allinurl:stormfront.org" is "stormfront" and "org" in the URL, not just the string "stormfront.org" as might be naively thought). Stormfront.org is a notorious racist site, often banned in various contexts.

Consider the following US search:
http://www.google.com/search?num=100&hl=en&q=allinurl%3Astormfront.org
This returned: Results 1 - 27 of about 50,700.

Now compare with the German counterpart (Google.DE):
http://www.google.de/search?num=100&hl=en&q=allinurl%3Astormfront.org
This returned: Results 1 - 9 about 50,700.

Immediate observation: The rightmost (total) number is identical. So identical results are in the Google database. It's simply not displaying them. How is it determining which domain results to display?

Note the hosts of which "stormfront.org" URLs are visible on the German page:

irc.stormfront.org:8000/
www4.stormfront.org:81/
lists.stormfront.org:81/

What do these all have in common?
They all have a port number after the host name.
The exclusion pattern obviously isn't matching the ":number" part of the URL.
It's matching a pattern of "*.stormfront.org/" in the host, as in the following which are displayed the US search, but not the German search.

www.stormfront.org/
kids.stormfront.org/
women.stormfront.org/
nna.stormfront.org/
www4.stormfront.org/

Even more interesting, the German page has a broken URL listed at the bottom: http/www.stormfront.org/quotes.htm . That's not a valid URL, so it seems to escape the host check.

Thus, the suppression again appears to be implemented as a post-processing step using very simple patterns of prohibited results.

The same behavior is observed in a German "stormfront.org" images search
This returned: Results 1 - 6 about 1,410.
Versus a US "stormfront.org" images search
This returned: Results 1 - 18 about 1,410.
(note identical right-hand numbers, and hosts matching "*.stormfront.org/" pattern are suppressed in the German results)

And also in a German "stormfront.org" directory search
This returned: Results 1 - 8 about 15.
Versus a US "stormfront.org" directory search
This returned: Results 1 - 10 about 15.
(note again identical right-hand numbers, and hosts matching "*.stormfront.org/" pattern are suppressed in the German results)

Conclusion

Contrary to earlier utopian theories of the Internet, it takes very little effort for governments to cause certain information simply to vanish for a huge number of people.


Version 1.0 Mar 10 2003

Support

This work was not funded by anyone, and has no connection to any organization. In fact, if anyone is providing financial support for such projects, the author would like to know.

Note: Some of this material appeared earlier in the author's Infothought blog


Mail comments to: Seth Finkelstein

For future information: subscribe to Seth Finkelstein's Infothought list or read the Infothought blog

(if you subscribed a few months ago, please resubscribe due to a crash)

See more of Seth Finkelstein 's Censorware Investigations

No comments: