CollegeGuide

Today, we will discuss our implementation of the Project Partner Matcher. This is one of the major components of the system we deemed rather challenging to develop due to the level of considerations and planning required.

What is the plan?

The idea of the Partner Matcher is to match students in the same project or event queue for a module, by evaluating details provided in their profiles and offer partner recommendations. This is a feature meant to assist students in deciding on a project partner if they are unsure of who to work with, and to offer a chance for more educated decisions.

Initially, we planned to simply evaluate the bio similarity for each profile and make a decision on that as a proof-of-concept. However, as the project grew in scope and the demand for more technical challenges came up in our meetings with the supervisor, it became clear that we have to come up with some more depth for the Matcher.

Our aim turned into looking at existing state-of-the-art algorithms for more accurate matching.

An Initial Setup

Firstly, the profile data we currently possessed only had engagement and bios as viable comparisons between users. Hence, more fields were added for evaluation: Skills, Interests and Preferred Project Roles. The backend was set up so that API requests from the frontend can be passed via Axios into appropriate functions on the backend. We set up a model, serializer and view set for a MatcherQueue and each of the new profile fields. A frontend design for queue entries was designed and some basic testing was done to see if everything is responding as expected.

Designing the First Algorithm

The basic principle was to go through each profile pair that are part of the same queue and compare their profiles. When two users share skills and interests, their score goes up. When their preferred project roles differ, their score goes down. When the words in their bios are similar, their score also goes up. In particular, the bio comparison tokenised each word into sets to remove duplicates and found matches between words of the two bios. At this point, we have not yet considered the weights for these. We were able to associate users with queues and allowed for joining and leaving on the backend. We also added a “Test” button that runs the algorithm on the backend.

To evaluate similarity, we looked into different methods such as simply looping through the entries and incrementing a counter for each match. Eventually, we came across the Jaccard index. It is essentially a ratio between the intersection and union of two sets. Thus, this similarity evolved into (A AND B) / (A OR B), where A and B were the sets of each field per profile pair in a matcher queue. We felt this was a more sophisticated and efficient solution. Soon, however, we noticed that there was a problem with Jaccard similarity. It was always a maximum of 0.5, not 1.0 for matching score for each field. This is because the (A OR B) denominator combined even the duplicate fields from both bios (that were shared among them). We changed it to max(A,B).

We also wanted to discourage people from just “having all interests” to match better with anyone, so we reduced the maximum interests one can have to 5, and also made the algorithm’s matching score drop as the difference between the amount of profile field entries increases (min(A,B) / max(A,B)). For example, if A has 5 interests and B has 1, then the final score for the interests field was multiplied by 1/5. If either of the profiles didn’t have fields specified, the function returned 0.

def compute_ratio_similarity(profile1_field, profile2_field):
    '''
    Computes a simple ratio of the numeric fields between two profiles.
    min(x,y) / max(x,y)
    Note: Greater values reduce the significance of minor dissimilarity.
    '''
    
    if not profile1_field or not profile2_field:
        return 0
    return min(profile1_field, profile2_field) / max(profile1_field, profile2_field)


def compute_similarity(profile1_field, profile2_field, inverse=False):
    """
    Find the similarity of two fields based on the ratio of intersection and the longest field.
    Essentially, Intersection(A,B) / Max(A,B), a modified version of Jaccard similarity. 
    Original source: https://www.geeksforgeeks.org/how-to-calculate-jaccard-similarity-in-python/
    Match score is multiplied by the ratio of the lengths of the field arrays (to promote matching of users with similar amount of interests, etc.)
    Inverse parameter, if true, checks for dissimilarity (1 - similarity_score)
    """

    if not profile1_field or not profile2_field:
        return 0
    match_score = len(profile1_field & profile2_field) / max(len(profile1_field), len(profile2_field))
    match_score *= compute_ratio_similarity(len(profile1_field), len(profile2_field))
    if inverse:
        return 1 - match_score
    return match_score

Improving the Algorithm

Afterwards, we also decided to add an engagement metric that took the similarity of levels between users from the experience system. This was designed to match people with a shared interest for the platform more. Lastly, we decided to work out a bio similarity evaluation as was initially planned that was a bit more advanced. First, we asked ChatGPT for ideas on what bio matching algorithms we could use. It suggested a TF-IDF Vectoriser whose result is passed into a cosine function. However, after some of our own research, we came across a more state-of-art algorithmic approach using Setence-BERT that uses a pre-trained language model to find similarities between sentences using a Cross Encoder. This was a relatively powerful and simple amendment to the algorithm that made the bio evaluation far more accurate thanks to machine learning. Essentially, each bio string is encoded and each word/token is meticulously compared with all other words, resulting in an accurate evaluation.

Why not use Groq?

We have tried using the Chatbot’s model which would save space required to store a local model. However, with the limitations of a cloud-based model and its request limit, it was unfeasible to rely on it.

Final Touches

The Cross Encoder approach ended up being rather expensive time-wise so we switched to a lighter, faster approach using a Sentence Transformer with a lighter training dataset. We have also put all bios into a matrix prior to computing for a quick one-time batch comparison. This brought incredible speed-ups to the algorithm.

However, the size of the backend package for this ended up being too large for our liking. Hence, we decided to look into just using transformers without the addition features of the sentence_transformers package we used previously. This also meant we had to create a simple embedding function to simulate Sentence Transformer:

def get_embedding(text):
    """Generate an embedding for a given text using the Transformer model."""
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256) # Max 256 token (bio is normally 256 max anyway)
    with torch.no_grad():  # Don't care about gradient to save resources
        outputs = model(**inputs) # Unpack our token dict into keyword arguments & pass into model
    return outputs.last_hidden_state.mean(dim=1).numpy() # Process embeddings, put to 1 dimension (embedding) for 1 result. Convert from tensor to array (numpy).

During this process, we also discovered that the existing cosine function for our previous transformer model had a range of -1 to 1 (as is the case with a standard cosine function output in mathematics), rather than 0 to 1 so we had to normalise it by adding 1 to the final result and dividing by half.

We have also introduced a final weight distribution to the algorithm as follows:

total_score = \
                0.3 * preferred_roles_similarity + \
                0.25 * skills_similarity + \
                0.2 * interests_similarity + \
                0.15 * engagement_similarity + \
                0.1 * bio_similarity

We decided that the most important aspects should be not matching people with the same roles, ensuring people share similar skills and interests for efficient workload. Engagement should not be an overwhelming factor as certain people might not find the platform as useful as others. And finally, we felt that bios were not of the greatest importance, so we made their weight only 10%.

The final logic for the partner is as follows:

def match_profiles_logic(queue_id):
    '''
    The core logic of the partner matcher.
    Compares the weighted similarities & differences of various factors for each profile pair
    in a queue, providing a final matching score for each.
    Weights (ordered by significance, totalling to 1):
        - Preferred project role dissimilarity: 0.3
        - Skills similarity: 0.25
        - Interest similarity: 0.2
        - Platform Engagement similarity: 0.15
        - Bio similarity: 0.1
    '''
    matches = []

    queue = get_object_or_404(MatcherQueue, id=queue_id)    
    user_ids = queue.users.all()
    profiles = Profile.objects.filter(user__in=user_ids)

    bios = [profile.bio for profile in profiles]
    # Embedding process simulated based on SentenceTransformer previously used
    embeddings = np.vstack([get_embedding(bio) for bio in bios]) # Vstack recommended by ChatGPT (https://numpy.org/doc/stable/reference/generated/numpy.vstack.html)
    bio_similarity_matrix = (cosine_similarity(embeddings) + 1) / 2  # Get matrix of embeddings & Normalize to range [0,1]

    
    # Go through all profile pairs in the queue
    for i, profile1 in enumerate(profiles):
        for j in range(i + 1, len(profiles)):
            profile2 = profiles[j]

            preferred_roles_similarity = compute_similarity(
                profile1.preferred_roles.values_list('title', flat=True),
                profile2.preferred_roles.values_list('title', flat=True),
                inverse=True
            )
            skills_similarity = compute_similarity(
                profile1.skills.values_list('title', flat=True),
                profile2.skills.values_list('title', flat=True)
            )
            interests_similarity = compute_similarity(
                profile1.interests.values_list('title', flat=True),
                profile2.interests.values_list('title', flat=True)
            )
            engagement_similarity = compute_ratio_similarity(profile1.level, profile2.level);
            bio_similarity = float(bio_similarity_matrix[i, j]) # Find corresponding matrix pair similarity score

            total_score = \
                0.3 * preferred_roles_similarity + \
                0.25 * skills_similarity + \
                0.2 * interests_similarity + \
                0.15 * engagement_similarity + \
                0.1 * bio_similarity
            
            if total_score > 0.05:  # Ignore low matching scores
                matches.append({
                    "user1": profile1.user.id,
                    "user2": profile2.user.id,
                    "roles_score": preferred_roles_similarity,
                    "skills_score": skills_similarity,
                    "interests_score": interests_similarity,
                    "engagement_score": engagement_similarity,
                    "bio_score": bio_similarity,
                    "final_score": total_score,
                })

    return sorted(matches, key=lambda x: x["final_score"], reverse=True)

Emailing Users

As we had already developed the email framework for email verifications, we re-used some logic in order to try and create an automatic email service when a queue has reached its deadline for matching. This email is to include your top 3 matches along with the email of that user in order to get in touch with them further. A Celery scheduler task was used to query all queues every minute if their deadline came up and to archive it and email matched users if it has. Setting up this scheduler was no trivial task and has a dedicated blog entry.

In order to ensure each user receives only their top three matches, a dictionary of user_ids keys each with a list of their respective matches was created, ordered and sliced in the following algorithm:

def send_match_notification(request, queue_id):
    """Send match notifications to users in the specified matcher queue."""
    try:
        queue = get_object_or_404(MatcherQueue, id=queue_id)   
        matches = match_profiles_logic(queue_id)
        user_matches = {}

        # Organize matches by user and select top 3 for each
        for match in matches:
            user1, user2 = match["user1"], match["user2"]
            user_matches.setdefault(user1, []).append(match) # get current list (or set to []) for user & append current match
            user_matches.setdefault(user2, []).append(match)

        for user_id, user_match_list in user_matches.items():
            user_match_list.sort(key=lambda x: x["final_score"], reverse=True) # sort matches
            top_matches = user_match_list[:3]  # get top 3
            user = User.objects.get(id=user_id)
            match_details = "\n".join([
                f"- {User.objects.get(id=m['user2']).first_name} {User.objects.get(id=m['user2']).last_name} (Email: {User.objects.get(id=m['user2']).email})"
                for m in top_matches
            ])
            
            # Email all matches
            if match_details:
                send_mail(
                    subject=f"Your Top Partner Matches - {queue.title}",
                    message=f"Here are your top matches:\n{match_details}",
                    from_email=HOST_EMAIL,
                    recipient_list=[user.email],
                )
        
    except Exception as e:
        print(f"Error sending match notifications: {e}")