Building Search Engine with Python

#So... why build a search engine?

Okay hear me out. I know "build a search engine" sounds like one of those insane side project ideas you'd laugh at. But here's the thing: you don't need to outdo Google. Even a basic search engine teaches you a ton about how the web actually works under the hood.

We're talking tokenization, TF-IDF scoring, cosine similarity, stuff that sounds intimidating but is honestly pretty fun once you get your hands dirty. So let's just build the thing.

#Who's this for?

If you know a bit of Python and you're the type who learns best by building. No PhD required, I promise.

#Before you start, make sure you have:

#What we'll cover:

About 10 minutes to read through. Maybe 30 if you're coding along (which you should be).

***

#How search engines actually work

At the core, every search engine does three things:

Indexing → Searching → Ranking

That's it. It takes a bunch of documents, figures out what's in them, and when you search for something, it ranks them by relevance. We'll build each of these steps ourselves.

"The best way to learn how a search engine works is to build one yourself."

Someone on the internet, probably

That's the vibe. Let's go.

***

#The libraries we need

Short list, nothing crazy:

Install them like this:

pip install nltk scikit-learn flask
***

#Step 1: Tokenization (breaking text apart)

Before we can search anything, we need to clean the text and split it into individual words (tokens). Raw text has a lot of noise like punctuation, capitals, filler words that we need to strip out.

Here's a simple tokenizer:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
nltk.download('punkt')
nltk.download('stopwords')
 
def tokenize(text):
    # lowercase everything, then split into words
    tokens = word_tokenize(text.lower())
    
    # ditch punctuation and common words like "the", "is", "a"
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t.isalnum() and t not in stop_words]
    
    return tokens
 
# quick test
print(tokenize("The quick brown fox jumps over the lazy dog"))
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Notice how "the", "over" and punctuation got dropped? That's the goal, keep only the words that actually carry meaning.

***

#Step 2: TF-IDF (what words actually matter?)

This is the clever bit. TF-IDF stands for Term Frequency–Inverse Document Frequency. Sounds fancy, but the idea is simple:

A word that appears a lot in one doc but rarely elsewhere? Super relevant. A word that's everywhere (like "the")? Not so useful.

sklearn handles all the math for us:

from sklearn.feature_extraction.text import TfidfVectorizer
 
documents = [
    "Python is a popular programming language",
    "Search engines use TF-IDF to rank documents",
    "Natural language processing is a field of AI",
]
 
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
 
print(tfidf_matrix.shape)
# (3, 15): 3 documents, 15 unique terms

Each document is now a vector of numbers. Pretty wild, right?

***

#Step 3: Ranking with cosine similarity

Now for the actual search. When someone types a query, we turn that into a vector too, then measure how similar it is to each document. The closer the angle between two vectors, the more similar the content.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
 
documents = [
    "Python is a programming language",
    "Search engines use TF-IDF",
    "Machine learning is a subset of AI",
]
 
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(documents)
 
def search(query, top_n=2):
    query_vector = vectorizer.transform([query])
    scores = cosine_similarity(query_vector, doc_vectors).flatten()
    
    # sort by score, highest first
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    
    print(f"\nResults for: '{query}'")
    for i, (idx, score) in enumerate(ranked[:top_n]):
        print(f"  {i+1}. [{score:.2f}] {documents[idx]}")
 
search("Python programming")
search("AI and machine learning")

Run that and you'll see it actually returns relevant results. Genuinely satisfying the first time it works.

***

#Who works on this stuff?

Just to give you a sense of the roles involved in real search/NLP projects:

NameRoleExperience
AliceNLP Engineer3 years
BobData Analyst2 years
CharlieML Engineer4 years

In practice, a search engine project usually needs all three: someone who understands the language side, someone who can crunch the data, and someone who can train and tune the models.

***

#Putting it all together

Here's a minimal but working search engine in one file:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
 
# your "database" of documents
documents = [
    "Python is a popular programming language used in data science",
    "Search engines index and rank web pages for users",
    "TF-IDF is a technique to measure word importance in documents",
    "Flask is a lightweight web framework for Python",
    "Cosine similarity measures the angle between two vectors",
]
 
# build the index
vectorizer = TfidfVectorizer(stop_words='english')
index = vectorizer.fit_transform(documents)
 
def search(query):
    q_vec = vectorizer.transform([query])
    scores = cosine_similarity(q_vec, index).flatten()
    ranked_ids = scores.argsort()[::-1]
 
    print(f"\nResults for '{query}':")
    for rank, doc_id in enumerate(ranked_ids[:3], 1):
        print(f"  {rank}. {documents[doc_id]}  (score: {scores[doc_id]:.3f})")
 
# try it out
search("Python web framework")
search("how search engines rank pages")

You could also use BM25 here (we'll save that for a follow-up post), but TF-IDF is the perfect starting point.

***

#Want to go further?

Some resources worth bookmarking:

***

#Wrapping up

Honestly, search is one of those topics that sounds way scarier than it is. Once you break it down into tokenization → vectorization → ranking, it starts to click fast.

You've now got a working search engine. It's tiny, but it's yours and the same core ideas power real production systems.

From here you could add a Flask API, hook it up to a real document store, or try swapping TF-IDF out for embeddings (that's where things get really interesting).