Data collection

When dealing with patent data, a fundamental step concerns identifying which patents are associated with a specific technology. This generally involves two steps:

  • Defining the boundaries of a product concerning a technology;

  • Determining the ways in which a that boundaries can be identified in patent data.

When we apply this procedure to AI technologies, the first step is particularly problematic. As it was already mentioned in chapter 1, the definition of AI has changed many times in the past, and it is particularly complex for the researcher to codify this definition in a structured query. For the purpose of this dissertation, I applied the broadest possible definition of AI, following the model provided in Benzell et al. (2019).

The second step (determining the way in which such boundaries can be identified in patent data), involves a different set of decisions by the researcher. Among the literature on the methodology of patent analysis, two main method are used to individuate and locate a specific technology field. One is generally based on analyzing sequential parts of text of variable length and scope (such as abstract, claims, or description) of patent documents, and may be based on text mining techniques, using either heuristic, machine learning and deep learning techniques (in most cases mixed methodologies are used). The other is based on using metadata and bibliographic information, such as technological classes. Both approaches have advantages and disadvantages, and it is common practice to use a mixed methodology. For the purposes of this thesis, I used the approach adopted by WIPO (Benzell et al. 2019), in which three different queries were used to identify AI technologies, thus leading to determining that a patent involves the use of AI technologies if is present in either one of the three queries.

The first query was structured to target patent applications that had associated the Cooperative Patent Classification technological classes retained as specific of AI technologies by experts of the field. This led to the retrieval of \(54816\) unique patent applications. The second query was based on the hypothesis that a relatively large amount of AI-related patents may have been classified in non-specifically AI-related classification codes and could only be captured using keywords. The second subset of unique patents that contained a selected list of keywords in their titles or abstracts returned \(6651\) unique patent applications.

The third query aimed to combine symbol-based and keyword-based search, retrieving unique patents that were assigned either one of the International Patent Classification technological classes or one of a second group of CPC technological classes, and which contained in their titles and abstracts at least one of the keywords present in a second selected list. This has led to the retrieval of \(34269\) unique applications. Differently from the WIPO query strategy, this third query was limited to CPC and IPC codes, since the sample choice was restricted to PCT patents, which have no technological classes assigned using the Japanese scheme.

After dropping duplicate patents, the total subset retrieved contained \(91797\) unique applications.

The classification codes and keywords used for the query are contained in the csv files query_codes.csv and keywords.csv, available in the folder data/data_gathering of the GitHub repository of this dissertation (Nardin 2021).

4.6 Control sample

To compare the result of the generality index of AI patents, I built a control sample to verify whether AI-related patents had, on average, a higher generality. The control sample was built on three criteria: the patents had to be filed at the same patent authority (in our case, WIPO), in the same year, and had to have the most similar number of forward citations. The matching strategy was 1:1, with no other weights assigned. This choice was motivated by the fact that by using a matching strategy based on technological classes would have increased the risk of including in the control sample false negatives.

4.7 An alternative source of data: Open Patent Services

During the process of writing this thesis, before I gained access to the PATSTAT 2018b, I used the Open Patent Services API to search for patents related to AI. To facilitate the authentication process and perform query directly from R, I wrote various functions that I united in an R package, Rops, available on GitHub (Nardin 2020).

References

Benzell, Seth, Nick Bostrom, Erik Brynjolfsson, Yoon Chae, Frank Chen, Myriam Côté, Boi Faltings, Kay Firth-Butterfield, John Flaim, and Dario Floreano. 2019. “Technology Trends 2019: Artificial Intelligence.” WIPO.

Nardin, Alessio. 2020. “Rops: An R Package to Access Ops Api by Epo.” GitHub Repository. https://github.com/AlessioNar/Rops; GitHub.

Nardin, Alessio. 2021. “AI-as-Gpt.” GitHub Repository. https://github.com/AlessioNar/AI-as-GPT; GitHub.