MCPcopy
hub / github.com/mwmbl/mwmbl

github.com/mwmbl/mwmbl @main sqlite

repository ↗ · DeepWiki ↗
1,185 symbols 5,001 edges 198 files 336 documented · 28%
README

Mwmbl - the Open Source and Non-Profit Web Search Engine

Matrix

No ads, no tracking, no profit

Mwmbl is a non-profit, open source search engine where the community determines the rankings. We aim to be a replacement for commercial search engines such as Google and Bing.

image

We have our own index powered by our community - while the index itself is centralised, the crawling is distributed across servers run by our volunteers.

Our index is currently much smaller than those of commercial search engines, but you can help change that by joining us!

Community

Our main community is on Matrix but we also have a Discord server for non-development related discussion.

The community is responsible for crawling the web (see below) and curating search results. We are friendly and welcoming. Join us!

Documentation

All documentation is at https://book.mwmbl.org.

Crawling

If you have spare computer power and bandwidth, for now the best way you can help is by running our command line crawler.

If you have Firefox you can help out by installing our extension. This retrieves search results from commercial search engines to use to evaluate our own ranking. It does not use or access any of your personal data.

Why a non-profit search engine?

The motives of ad-funded search engines are at odds with providing an optimal user experience. These sites are optimised for ad revenue, with user experience taking second place. This means that pages are loaded with ads which are often not clearly distinguished from search results. Also, eitland on Hacker News comments:

Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - [to some] degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

But what about...?

The space of alternative search engines has expanded rapidly in recent years. Here's a very incomplete list of some that have interested me:

Of these, YaCy is the closest in spirit to the idea of a non-profit search engine. The index is distributed across a peer-to-peer network. Unfortunately this design decision slows the fetching of search results.

Marginalia Search is fantastic, but our goals are different: we aim to be a replacement for commercial search engines whereas Marginalia aims to provide a different type of search.

All other search engines that I've come across are for-profit. Please let me know if I've missed one!

Designing for non-profit

To be a good search engine, we need to store many items, but the cost of running the engine is at least proportional to the number of items stored. Our main consideration is thus to reduce the cost per item stored.

The design is founded on the observation that most items rank for a small set of terms. In the extreme version of this, where each item ranks for a single term, the usual inverted index design is grossly inefficient, since we have to store each term at least twice: once in the index and once in the item data itself.

Our design is a giant hash map. We have a single store consisting of a fixed number N of pages. Each page is of a fixed size (currently 4096 bytes to match a page of memory), and consists of a compressed list of items. Given a term for which we want an item to rank, we compute a hash of the term, a value between 0 and N - 1. The item is then stored in the corresponding page.

To retrieve pages, we simply compute the hash of the terms in the user query and load the corresponding pages, filter the items to those containing the term and rank the items. Since each page is small, this can be done very quickly.

Because we compress the list of items, we can rank for more than a single term and maintain an index smaller than the inverted index design. At least, that's the theory. This idea has yet to be tested on a large scale.

How to contribute

There are multiple ways to help: - Help us crawl the web - Donate some money towards hosting costs and supporting our volunteers - Give feedback/suggestions - Assist in development of the engine itself

If you would like to help in any of these or other ways, thank you! Please join our Matrix chat server or email the main author (email address is in the git commit history).

Development

Local Testing

For trying out the service locally see the section in the Mwmbl book.

Using Dokku

Note: this method is not recommended as it is more involved, and your index will not include any data unless you set up a crawler to crawl to your server. You will need to set up your own Backblaze or S3 equivalent storage, or have access to the production keys, which we probably won't give you.

Follow the deployment instructions

Frequently Asked Question

How do you pronounce "mwmbl"?

Like "mumble". I live in Mumbles, which is spelt "Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in "don't search, just mwmbl!"

Star History

Star History Chart

Acknowledgements

Many thanks to Mohamed Alm for security consultations and recommendations.

Core symbols most depended-on inside this repo

get
called by 278
mwmbl/indexer/fsqueue.py
append
called by 94
mwmbl/justext/core.py
format
called by 56
mwmbl/templatetags/humanbytes.py
create
called by 54
mwmbl/tinysearchengine/indexer.py
tokenize
called by 24
mwmbl/tokenizer.py
_super_search_monthly_key
called by 21
mwmbl/quota.py
check_email_verified
called by 20
mwmbl/platform/api.py
predict
called by 19
mwmbl/tinysearchengine/ltr.py

Shape

Function 693
Method 236
Class 208
Route 48

Languages

Python98%
TypeScript2%

Modules by API surface

mwmbl/platform/api.py59 symbols
test/test_search_api_key.py50 symbols
mwmbl/crawler/app.py39 symbols
mwmbl/views.py37 symbols
test/test_super_search.py35 symbols
mwmbl/tinysearchengine/super_search.py33 symbols
mwmbl/tinysearchengine/rank.py28 symbols
mwmbl/justext/core.py28 symbols
test/test_rust_pipeline.py26 symbols
mwmbl/indexer/fsqueue.py26 symbols
mwmbl/tinysearchengine/indexer.py24 symbols
mwmbl/platform/schemas.py24 symbols

Dependencies from manifests, versioned

@vitejs/plugin-legacy2.3.1 · 1×
chart.js4.4.0 · 1×
htmx.org1.9.9 · 1×
sortablejs1.15.0 · 1×
terser5.16.1 · 1×
vite3.2.3 · 1×
boto31.35.99 · 1×
django4.2.4 · 1×
django-ninja1.3.0 · 1×
gunicorn21.0.0 · 1×
mmh33.0.0 · 1×
psycopg2-binary2.9.3 · 1×

Datastores touched

dbDatabase · 1 repos
dbnameDatabase · 1 repos
mwmbl_testDatabase · 1 repos

For agents

$ claude mcp add mwmbl \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact