Entity Matching: Solving the Problem of Duplicate Entities in Knowledge Graphs

In this blog, I’ll walk through what entity matching is, why it matters in knowledge graphs, and how I designed an efficient solution—available on my GitHub repo.

There are huge issues in the repo (cannot handle large amount of entity), the following ideas are just for reference.

Background

When working with knowledge graphs (KGs), one of the biggest challenges is that the same entity can appear under different names.

For example, consider the following two triplets:

a = ['澳門', '有', '賭場']
b = ['鏡海', '有', '賭場']

Here, both 澳門 (Macao) and 鏡海 (Mirror Sea, an old poetic name for Macao) actually refer to the same place. However, for an algorithm evaluating whether a matches b (e.g., prediction vs. ground truth), this creates a problem.

To solve this, we need a standardized representation of each entity—so that no matter which synonym is used, the system recognizes them as the same concept.

Mechanism

The most straightforward solution would be to build a dictionary of synonyms:

{
    "澳門": ["鏡海", "濠鏡澳", "濠江", "海鏡", "龍涯門"],
    "香港": ["HONG KONG", "HK"],
    ...
}

This works, but it has a performance drawback: to check whether two entities match, we would need to search the dictionary linearly, which is $\mathcal{O}(N)$ in time complexity.

A more efficient approach is to use a disjoint-set (union-find) structure. With this structure, each entity is assigned to a group, and all synonyms in that group map to the same canonical index.

For example:

index   :    0     1      2      3      4     5         6       7
entity  :  '澳門' '鏡海' '濠鏡澳' '濠江' '香港' '海鏡' 'HONG KONG' 'HK'
refer to:    0      0      0      0     4     0         4       4

Now, checking an entity’s standardized representation is constant time O(1)—making entity matching much more scalable.

Example Usage

Installation

First, install the library by following the installation guide.

Initialization

All functionality is encapsulated in the Stemer class:

import entity_matching as podstem

# Initialize
ds = podstem.Stemer()

Signature of the constructor:

def __init__(self, word_definition_db_path: Path | None=None, verbose=False, model=None) -> None

word_definition_db_path: path to a cached database of entity–definition mappings. If you want to record the data in a file, you have to provide the file path.
verbose: whether to log extra information
model: specify the LLM backend (see API Key)

Example of using LLM Qwen

ds = podstem.Stemer(model='qwen-plus')

Adding Data

Entities are added as key–definition pairs. The definition is what really matters, not the key.

data_dict = {
    '澳門': 'Refers to Macao, the Special Administrative Region of China and former Portuguese colony.',
    '濠鏡澳': 'An ancient Chinese name for Macao, meaning "Oyster Mirror Bay."',
    '香港': 'Refers to Hong Kong, the Special Administrative Region of China.',
    '鏡海': 'An ancient poetic name for the waters around Macao, literally "Mirror Sea."'
}

ds.add_dict(data_dict)

Alternatively, you can add them one by one:

ds.add('澳門', 'Refers to Macao, the Special Administrative Region of China.')
ds.add('濠鏡澳', 'An ancient Chinese name for Macao, literally "Oyster Mirror Bay."')

Building the Model

Once data is added, call build() to finalize the structure:

ds.build()

⚠️ There is no going back after you activate build(). That means, you cannot add any data anymore. Even though you add extra information and try to build() second time, it just break the whole algorithm, the behaviour is undefinied! (A dynamic building implementation is still in ToDo list :)

Querying

You can now resolve entities to their canonical representation:

for entity in data_dict.keys():
    print(f'{entity} -> {ds.stem(entity)}')

print('Canonical representations: ' + str(ds.to_dict()))

Asynchronous Usage

The package also supports asyncio. The only difference is the use of async/await and the abuild() method:

import entity_matching as podstem
import asyncio

async def main():
    ds = podstem.Stemer()

    data_dict = {
        '澳門': 'Refers to Macao, the Special Administrative Region of China and former Portuguese colony. A beautiful place',
        '濠鏡澳': 'An ancient Chinese name for Macao, literally meaning "Oyster Mirror Bay," referring to the area\'s geographic features before it became known as Macao.',
        '香港': 'Refers to Hong Kong, the Special Administrative Region of China and former British colony.',
        '鏡海': 'An ancient poetic name for the waters around Macao, literally meaning "Mirror Sea."'
    }

    ds.add_dict(data_dict)

    # Asynchronous entity matching
    await ds.abuild()

    for entity in data_dict.keys():
        print(f'{entity} -> {ds.stem(entity)}')

    print('Canonical representations: ' + str(ds.to_dict()))

asyncio.run(main())

Despite having improvement in performance, it might consume (a little bit) more API call, since it cannot be optimized for concurrency issue.

API Key

Since the system leverages LLMs for semantic matching, you’ll need to configure an API key. Supported models:

OpenAI ChatGPT (any version: gpt-3.5-turbo, gpt-4.1, gpt-5)
DeepSeek (deepseek-chat, i.e., DeepSeek-R1)
Qwen (qwen-plus)

Keys are stored in a .env file at the project root:

{
    "OPENAI_API_KEY": "...",
    "DEEPSEEK_API_KEY": "...",
    "DASHSCOPE_API_KEY": "..."
}

Cautious

Once an entity is stored, the existing definition takes precedence over new data. In other words, if a key already exists in the object, adding it again will not overwrite the previous value.

For example, suppose the object already contains:

{'澳門': 'Refers to Macao, the Special Administrative Region of China and former Portuguese colony. A beautiful place'}

If you try to add a new definition such as:

{'澳門': 'Refers to the Special Administrative Region of China and former Portuguese colony.'}

the update will be ignored, and the original definition remains unchanged. This ensures consistency: once an entity–definition pair is committed to the database, it cannot be replaced.

To-Do

Implement dynamic building (add entities after initialization without breaking the structure)

Entity Matching: Solving the Problem of Duplicate Entities in Knowledge Graphs#

Background#

Mechanism#

Example Usage#

Installation#

Initialization#

Adding Data#

Building the Model#

Querying#

Asynchronous Usage#

API Key#

Cautious#

To-Do#