Recommend Keywords#

In order to auto recommend wikidata keywords to users, here’re two things we need to achieve.

First, we’ll tokenization the input string and find out which ones can be the keyword for the sentence. We’ll introduce a NLP tool develop by the ckiplab of Academic Sinica called ckip-tagger. Within its help, we can do NER to the input stence and hence obtain keywords which are potentially be wikidata keywords.

Second, after getting a list of potential words, we’ll check if they are wikidata keyword. Here we send request through the wikidata API, to search if the keyword is recorded in wikidata.

Finishing the two step works, we’ll finally obtain a list of keywords in the input string, and also are wikidata keywords. That result is what we recommend to the users.

Tip

You can choose a particular dataset by its index and explore the geographic information recommendations provided. To proceed, follow these steps:

Click on the “rocket icon” located in the top-right corner.
Select the option labeled Live Code from the menu.
Once the environment is launched, you’ll be able to manually execute each code cell.

For any hidden code cells, simply click on Show code cell source and subsequently click run within each respective cell section.

Load Data#

Before we start trying this feature, we’ll load the input data from Depositar.

Previously, we downloaded metadata of datasets from Depositar through its API, randomly selected 10 datasets, and stored them in a file named example_depositar_data.json in the assets/ directory. Since we’ll only use this as an example input, there’s no need to update this file, and the code for calling the API is not included in this notebook.

Obtain metadata from datasets#

We can chose one datasets by its index:

dataset_idx = 6

✨ You can change the index to see the result of different dataset. (from 0 to 99)

Show dataset content#

Below displays the information of our selected dataset:

Title : 林邊排水水質自然淨化工程-林邊排水水質自然淨化處理場域規劃設計
Notes : 
Resource Names : ['']
Resource Descriptions : ['']
Organization Title : 「全國水環境改善計畫」108-109年度屏東縣政府水環境改善輔導顧問團
Organization Description : 

Step 1: NER task#

Import Models#

Here we import the transformer models, and do the NER to our input data.

NER task#

Output#

Below is the output list of the NER result:

林邊 : GPE
屏東縣政府 : ORG

Step 2: Searching through Wikidata API#

After searching each potential word obtained in previous step, now we are going to check if each word is a wikidata keyword.

Request Wikidata API#

Output#

Here is the output of searching result:

林邊

QID: Q708239, Label: Linbian, Description: rural township of Taiwan
QID: Q7564094, Label: Woodside, Description: neighborhood in Queens, New York City, United States
QID: Q2679556, Label: Seclusion Near a Forest, Description: 1976 film by Jiří Menzel
QID: Q6550302, Label: Linbian River, Description: river in Taiwan
QID: Q1033565, Label: Linbian Station, Description: railway station
QID: Q17026596, Label: Linbian Interchange, Description: No description available
QID: Q11107072, Label: Woodside, Description: Residential building in Hong Kong
-------------------------------------------
屏東縣政府

QID: Q11042707, Label: Pingtung County Government, Description: executive branch of Pingtung County, Taiwan
QID: Q107359613, Label: Department of Public Works, Pingtung County Government, Description: No description available
QID: Q115624611, Label: 屏東縣政府消防局, Description: No description available
QID: Q115619758, Label: Pingtung County Police Bureau, Description: No description available
-------------------------------------------

Recommend Keywords

Contents

Recommend Keywords#

Load Data#

Obtain metadata from datasets#

Show dataset content#

Step 1: NER task#

Import Models#

NER task#

Output#

Step 2: Searching through Wikidata API#

Request Wikidata API#

Output#