Applicant disambiguation

The process of applicant harmonization is complex and subjected to a trade-off between precision and accuracy. One of the main issues in the analysis of patent applicants derives from the different ways their name is registered in patent documents. A firm may apply for a patent through different branches according to their geographical localization, legal status, corporate policy, or other reasons. Name differentiation may even be a deliberate choice by companies, to actively obstacle business intelligence operations of competitors. Some patent authorities compiled databases with the purpose of partially overcome this issue. Datasets compiled by trustworthy authorities are particularly useful for our research purposes. For the purposes of this thesis I used the HAN database, developed by OECD (2019), that associated to the patent identifier appln_id a large number of harmonized applicants’ names. PATSTAT provides such feature in the table tls206_pers. Unfortunately, the HAN database is rich of false negatives, since it prioritizes accuracy over completeness. For the purpose of this research, a higher degree of completeness was required, thus the applicant sample data was subjected to additional harmonization steps.Another issue of applicant name harmonization derives from the fact that patents are fully transferable property items and that the PATSTAT database does not keep track of property changes after its publication thus without following it during the patent life-span. However, since we are mostly interested in determining the innovator that developed the original invention, this limit can be left out.

First, patent applicants for the AI sample were retrieved from the original database, that returned \(96729\) unique patent applicants. Then a disambiguation procedure was applied and the number of unique applicants was reduced to \(93646\) unique applicants names. The harmonized applicant names coming from the HAN database were first subjected to the harmonize function contained in the harmonizer package (Vlasov 2020), that performs parsing operations using the following algorithm:

Cleaning spaces
Removing HTML codes
Translating non-ASCII to ASCII
Upper casing
Standardizing organizational names
Removing brackets
Cleaning spaces

In particular, during the fifth step, the harmonize function looks for standard company information such as ‘corporation’, ‘company’, and ‘limited’, or information related to the geographical position of the applicants such as ‘America’, or ‘Europe’, and transforms them in a standardized format such as CORP, IN, LT, USA, or EU that are later added to the final part of the application name.

Next, these particles were removed to increase the harmonization. The complete list of the stop-words used in the procedure can be found in the data/applicants/stopwords.csv file in the Github repository (Nardin 2021). This allowed to gather under the same applicant patents that were previously associated to different applicant names, such as ‘GOOGLE LL’, ‘GOOGLE LT’ and ‘GOOGLE USA’.

Further harmonization would require to group companies with different names that are part of the same conglomerate. However, it would require keeping track of mergers and acquisitions regarding companies and I had no access to such database, since they are mostly licensed on a commercial basis.

References

Nardin, Alessio. 2021. “AI-as-Gpt.” GitHub Repository. https://github.com/AlessioNar/AI-as-GPT; GitHub.

OECD. 2019. “OECD Han Database.” http://www.oecd.org/science/inno/intellectual-property-statistics-and-analysis.htm#ip-data.

Vlasov, Stanislav. 2020. “Harmonizer - an R Package to Harmonize Organizational Names.” GitHub Repository. https://github.com/stasvlasov/harmonizer; GitHub.