Applicant disambiguation
The process of applicant harmonization is complex and subjected to a
trade-off between precision and accuracy. One of the main issues in the
analysis of patent applicants derives from the different ways their name
is registered in patent documents. A firm may apply for a patent through
different branches according to their geographical localization, legal
status, corporate policy, or other reasons. Name differentiation may
even be a deliberate choice by companies, to actively obstacle business
intelligence operations of competitors. Some patent authorities compiled
databases with the purpose of partially overcome this issue. Datasets
compiled by trustworthy authorities are particularly useful for our
research purposes. For the purposes of this thesis I used the HAN
database, developed by OECD (2019), that associated to the patent
identifier appln_id
a large number of harmonized applicants’ names.
PATSTAT provides such feature in the table tls206_pers
. Unfortunately,
the HAN database is rich of false negatives, since it prioritizes
accuracy over completeness. For the purpose of this research, a higher
degree of completeness was required, thus the applicant sample data was
subjected to additional harmonization steps.Another issue of applicant
name harmonization derives from the fact that patents are fully
transferable property items and that the PATSTAT database does not keep
track of property changes after its publication thus without following
it during the patent life-span. However, since we are mostly interested
in determining the innovator that developed the original invention, this
limit can be left out.
First, patent applicants for the AI sample were retrieved from the original database, that returned \(96729\) unique patent applicants. Then a disambiguation procedure was applied and the number of unique applicants was reduced to \(93646\) unique applicants names. The harmonized applicant names coming from the HAN database were first subjected to the harmonize function contained in the harmonizer package (Vlasov 2020), that performs parsing operations using the following algorithm:
Cleaning spaces
Removing HTML codes
Translating non-ASCII to ASCII
Upper casing
Standardizing organizational names
Removing brackets
Cleaning spaces
In particular, during the fifth step, the harmonize function looks for standard company information such as ‘corporation’, ‘company’, and ‘limited’, or information related to the geographical position of the applicants such as ‘America’, or ‘Europe’, and transforms them in a standardized format such as CORP, IN, LT, USA, or EU that are later added to the final part of the application name.
Next, these particles were removed to increase the harmonization. The
complete list of the stop-words used in the procedure can be found in
the data/applicants/stopwords.csv
file in the Github repository
(Nardin 2021). This allowed to gather under the same applicant
patents that were previously associated to different applicant names,
such as ‘GOOGLE LL’, ‘GOOGLE LT’ and ‘GOOGLE USA’.
Further harmonization would require to group companies with different names that are part of the same conglomerate. However, it would require keeping track of mergers and acquisitions regarding companies and I had no access to such database, since they are mostly licensed on a commercial basis.
References
Nardin, Alessio. 2021. “AI-as-Gpt.” GitHub Repository. https://github.com/AlessioNar/AI-as-GPT; GitHub.
OECD. 2019. “OECD Han Database.” http://www.oecd.org/science/inno/intellectual-property-statistics-and-analysis.htm#ip-data.
Vlasov, Stanislav. 2020. “Harmonizer - an R Package to Harmonize Organizational Names.” GitHub Repository. https://github.com/stasvlasov/harmonizer; GitHub.