Entity Matching: Solving the Problem of Duplicate Entities in Knowledge Graphs
In this blog, I’ll walk through what entity matching is, why it matters in knowledge graphs, and how I designed an efficient solution—available on my GitHub repo.
There are huge issues in the repo (cannot handle large amount of entity), the following ideas are just for reference.
Background
When working with knowledge graphs (KGs), one of the biggest challenges is that the same entity can appear under different names.
For example, consider the following two triplets:
a = ['澳門', '有', '賭場']
b = ['鏡海', '有', '賭場']
Here, both 澳門 (Macao) and 鏡海 (Mirror Sea, an old poetic name for Macao) actually refer to the same place. However, for an algorithm evaluating whether a
matches b
(e.g., prediction vs. ground truth), this creates a problem.
To solve this, we need a standardized representation of each entity—so that no matter which synonym is used, the system recognizes them as the same concept.
Mechanism
The most straightforward solution would be to build a dictionary of synonyms:
{
"澳門": ["鏡海", "濠鏡澳", "濠江", "海鏡", "龍涯門"],
"香港": ["HONG KONG", "HK"],
...
}
This works, but it has a performance drawback: to check whether two entities match, we would need to search the dictionary linearly, which is $\mathcal{O}(N)$ in time complexity.
A more efficient approach is to use a disjoint-set (union-find) structure. With this structure, each entity is assigned to a group, and all synonyms in that group map to the same canonical index.
For example:
index : 0 1 2 3 4 5 6 7
entity : '澳門' '鏡海' '濠鏡澳' '濠江' '香港' '海鏡' 'HONG KONG' 'HK'
refer to: 0 0 0 0 4 0 4 4
Now, checking an entity’s standardized representation is constant time O(1)—making entity matching much more scalable.
Example Usage
Installation
First, install the library by following the installation guide.
Initialization
All functionality is encapsulated in the Stemer
class:
import entity_matching as podstem
# Initialize
ds = podstem.Stemer()
Signature of the constructor:
def __init__(self, word_definition_db_path: Path | None=None, verbose=False, model=None) -> None
- word_definition_db_path: path to a cached database of entity–definition mappings. If you want to record the data in a file, you have to provide the file path.
- verbose: whether to log extra information
- model: specify the LLM backend (see API Key)
Example of using LLM Qwen
ds = podstem.Stemer(model='qwen-plus')
Adding Data
Entities are added as key–definition pairs. The definition is what really matters, not the key.
data_dict = {
'澳門': 'Refers to Macao, the Special Administrative Region of China and former Portuguese colony.',
'濠鏡澳': 'An ancient Chinese name for Macao, meaning "Oyster Mirror Bay."',
'香港': 'Refers to Hong Kong, the Special Administrative Region of China.',
'鏡海': 'An ancient poetic name for the waters around Macao, literally "Mirror Sea."'
}
ds.add_dict(data_dict)
Alternatively, you can add them one by one:
ds.add('澳門', 'Refers to Macao, the Special Administrative Region of China.')
ds.add('濠鏡澳', 'An ancient Chinese name for Macao, literally "Oyster Mirror Bay."')
Building the Model
Once data is added, call build()
to finalize the structure:
ds.build()
⚠️ There is no going back after you activate build(). That means, you cannot add any data anymore. Even though you add extra information and try to build() second time, it just break the whole algorithm, the behaviour is undefinied! (A dynamic building implementation is still in ToDo list :)
Querying
You can now resolve entities to their canonical representation:
for entity in data_dict.keys():
print(f'{entity} -> {ds.stem(entity)}')
print('Canonical representations: ' + str(ds.to_dict()))
Asynchronous Usage
The package also supports asyncio. The only difference is the use of async/await
and the abuild()
method:
import entity_matching as podstem
import asyncio
async def main():
ds = podstem.Stemer()
data_dict = {
'澳門': 'Refers to Macao, the Special Administrative Region of China and former Portuguese colony. A beautiful place',
'濠鏡澳': 'An ancient Chinese name for Macao, literally meaning "Oyster Mirror Bay," referring to the area\'s geographic features before it became known as Macao.',
'香港': 'Refers to Hong Kong, the Special Administrative Region of China and former British colony.',
'鏡海': 'An ancient poetic name for the waters around Macao, literally meaning "Mirror Sea."'
}
ds.add_dict(data_dict)
# Asynchronous entity matching
await ds.abuild()
for entity in data_dict.keys():
print(f'{entity} -> {ds.stem(entity)}')
print('Canonical representations: ' + str(ds.to_dict()))
asyncio.run(main())
Despite having improvement in performance, it might consume (a little bit) more API call, since it cannot be optimized for concurrency issue.
API Key
Since the system leverages LLMs for semantic matching, you’ll need to configure an API key. Supported models:
- OpenAI ChatGPT (any version:
gpt-3.5-turbo
,gpt-4.1
,gpt-5
) - DeepSeek (
deepseek-chat
, i.e., DeepSeek-R1) - Qwen (
qwen-plus
)
Keys are stored in a .env
file at the project root:
{
"OPENAI_API_KEY": "...",
"DEEPSEEK_API_KEY": "...",
"DASHSCOPE_API_KEY": "..."
}
Cautious
Once an entity is stored, the existing definition takes precedence over new data. In other words, if a key already exists in the object, adding it again will not overwrite the previous value.
For example, suppose the object already contains:
{'澳門': 'Refers to Macao, the Special Administrative Region of China and former Portuguese colony. A beautiful place'}
If you try to add a new definition such as:
{'澳門': 'Refers to the Special Administrative Region of China and former Portuguese colony.'}
the update will be ignored, and the original definition remains unchanged. This ensures consistency: once an entity–definition pair is committed to the database, it cannot be replaced.
To-Do
- Implement dynamic building (add entities after initialization without breaking the structure)