Recommend GeoInfo#

To automatically recommend locations along with their geographic information to users, we need to accomplish two key objectives.

To begin with, we will tokenize the input string to identify potential keywords for the given input. To achieve this, we will utilize a Natural Language Processing (NLP) tool developed by the ckiplab at Academic Sinica, called ckiplab/bert-base-chinese-ner. With the assistance of this tool, we can perform Named Entity Recognition (NER) on the input sentence, thereby extracting keywords that could potentially match Wikidata keywords.

Subsequently, once we have compiled a list of potential words, our next step involves verifying whether these words correspond to actual locations across the globe. For this purpose, we will make use of the Nominatim API to retrieve search results for the list of potential keywords.

Upon successfully completing these two steps, we will have a comprehensive compilation of locations and their associated geographic information. To provide users with an interactive preview of OpenStreetMap (OSM), we have implemented folium. This enables users to select a location from the aforementioned list, input its geographic details, and visualize it on the map.

Tip

You can choose a particular dataset by its index and explore the geographic information recommendations provided. To proceed, follow these steps:

  1. Click on the “rocket icon” located in the top-right corner.

  2. Select the option labeled Live Code from the menu.

  3. Once the environment is launched, you’ll be able to manually execute each code cell.

For any hidden code cells, simply click on Show code cell source and subsequently click run within each respective cell section.

Load Data#

Before we start trying this feature, we’ll load the input data from Depositar.

Previously, we downloaded metadata of datasets from Depositar through its API, randomly selected 10 datasets, and stored them in a file named example_depositar_data.json in the assets/ directory. Since we’ll only use this as an example input, there’s no need to update this file, and the code for calling the API is not included in this notebook.

Obtain metadata from datasets#

Hide code cell source
import json
import pandas as pd
import warnings

# function definition
def get_metadata(data_path, data_index):
    with warnings.catch_warnings():
        warnings.simplefilter(action='ignore', category=FutureWarning)
        data = pd.read_json(data_path)

        title = data.loc[data_index, 'title']
        notes = data.loc[data_index, 'notes']

        resources_names = []
        resources_desps = []
        for item in data.loc[data_index, 'resources']:
            if 'name' in item:
                resources_names.append(item['name'])
                resources_desps.append(item['description'])

        organization_title = data.loc[data_index, 'organization']['title']
        organization_desp = data.loc[data_index, 'organization']['description']

        df = pd.DataFrame({
            'Title': [title],
            'Notes': [notes],
            'Resource Names': [resources_names],
            'Resource Descriptions': [resources_desps],
            'Organization Title': [organization_title],
            'Organization Description': [organization_desp]
        })

        return df

We can chose one datasets by its index:

dataset_idx = 6

✨ You can change the index to see the result of different dataset. (from 0 to 99)

Show dataset content#

Below displays the information of our selected dataset:

Hide code cell source
if(dataset_idx < 100 and dataset_idx > 1):
    data_path = 'https://mere-cat.github.io/Metadata-Generator-for-Depositar/assets/example_depositar_data.json'
    df = get_metadata(data_path, dataset_idx)
    input_list = []
    for entity in df:
        print(entity, ':', df[entity][0])
        input_list.append(df[entity][0])
else:
    print('input number in the interval from 0 to 99')
Title : 林邊排水水質自然淨化工程-林邊排水水質自然淨化處理場域規劃設計
Notes : 
Resource Names : ['']
Resource Descriptions : ['']
Organization Title : 「全國水環境改善計畫」108-109年度屏東縣政府水環境改善輔導顧問團
Organization Description : 

Step 1: NER task#

Import Models#

Here we import the transformer models, and do the NER to our input data.

Hide code cell source
# NLP task model
from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(model="bert-base")

NER task#

Hide code cell source
# NER task
ner = ner_driver(input_list)

Output#

Below is the output list of the NER result:

Hide code cell source
# Show results
avoid_class = ['QUANTITY', 'CARDINAL', 'DATE', 'ORDINAL']
keyword_map = {}
for sentence_ner in ner:
   for entity in sentence_ner:
      if(entity[1] in avoid_class):
        continue
      keyword_map[entity[0]] = entity[1]

for key, value in keyword_map.items():
  print(key, ':', value)
林邊 : GPE
屏東縣政府 : ORG

Step 2: Searching through OSM Nominatim API#

After searching each potential word obtained in previous step, now we are going to check if each word is a wikidata keyword.

Requeest Nominatim API#

Hide code cell source
import requests

def search_osm_place(query):
    base_url = "https://nominatim.openstreetmap.org/search"
    params = {
        "q": query,
        "format": "json",
        "polygon_geojson": "1",  # Request GeoJSON polygons
        "limit": 7
    }

    response = requests.get(base_url, params=params)

    if response.status_code == 200:
        return response.json()
    else:
        return None

OSM Search Result#

Now, we’ll obtain a list of possible location for each named entity.

for item in keyword_map:
    result = search_osm_place(item)
    if result:
        print(f"OSM result for {item} is:")
        for place in result:
            print("📍", place["display_name"])
            print(str(place["geojson"]).replace("'", "\""))
        print('-------------------------------------------')
    else:
            print("No geoInfo provided.")
OSM result for 林邊 is:
📍 林邊, 仁愛路, 光林村, 林邊鄉, 屏東縣, 927, 臺灣
{"type": "Point", "coordinates": [120.5149753, 22.4314199]}
📍 林边, 礼县, 陇南市, 甘肃省, 中国
{"type": "Point", "coordinates": [105.247, 34.1612]}
📍 林邊, 黃埔村, 烈嶼鄉, 金門縣, 894, 臺灣
{"type": "Point", "coordinates": [118.2514395, 24.4465422]}
📍 林边, 华坪县, 丽江市, 云南省, 中国
{"type": "Point", "coordinates": [101.2901069, 26.7634965]}
📍 林邊, 仁和路, 仁和村, 林邊鄉, 屏東縣, 927, 臺灣
{"type": "Point", "coordinates": [120.515038, 22.4315225]}
📍 林邊, 仁愛路, 光林村, 林邊鄉, 屏東縣, 927, 臺灣
{"type": "Point", "coordinates": [120.514982, 22.4314002]}
📍 林邊, 成功路, 中莊, 崎峯村, 林邊鄉, 屏東縣, 927, 臺灣
{"type": "Point", "coordinates": [120.4901309, 22.4302016]}
-------------------------------------------
OSM result for 屏東縣政府 is:
📍 屏東縣政府, 527, 自由路, 勝利里, 屏東市, 屏東縣, 90001, 臺灣
{"type": "Polygon", "coordinates": [[[120.4874268, 22.6826975], [120.4874494, 22.6822285], [120.4877855, 22.6822422], [120.4877771, 22.6824174], [120.4878862, 22.6824219], [120.4878886, 22.6823723], [120.4881782, 22.6823842], [120.4881759, 22.6824321], [120.4883277, 22.6824383], [120.4883346, 22.6822964], [120.4886458, 22.6823091], [120.4886243, 22.6827568], [120.4884804, 22.682751], [120.4884744, 22.6828768], [120.4885186, 22.6828786], [120.4885102, 22.6830528], [120.487475, 22.6830104], [120.4874842, 22.6828181], [120.487576, 22.6828219], [120.4875816, 22.6827038], [120.4874268, 22.6826975]], [[120.487765, 22.6827672], [120.4882821, 22.682786], [120.4882933, 22.6825238], [120.4877762, 22.682505], [120.487765, 22.6827672]]]}
📍 屏東縣政府, 527, 自由路, 勝利里, 屏東市, 屏東縣, 90001, 臺灣
{"type": "Point", "coordinates": [120.4879096, 22.6829927]}
-------------------------------------------

Input geoJSON#

Select one of the location above, copy the geoJSON and paste to the below cell:

geoInfo = ''

Preview the Location in OSM#

Now, we can see the location you previously chosen displayed below:

Hide code cell source
import folium

center_coords = [25.041415686746607, 121.61472689731077]  # Sinica
m = folium.Map(location=center_coords, zoom_start=12)

if(len(geoInfo) == 0):
    print("please paste the geoJSON in the 'geoInfo' string.")
else:
    geojson = eval(geoInfo)
    folium.GeoJson(geojson).add_to(m)
    m.fit_bounds(m.get_bounds())
    display(m)
please paste the geoJSON in the 'geoInfo' string.