Understanding versions with GraphQL and Python

The taxonomy has multiple versions to help track changes and maintain backward compatibility. We can start by retrieving all the version ids. Just as in the introduction, we use the Requests library. We start with some imports and some definitions:

import json
import requests
import pprint
pp = pprint.PrettyPrinter(width=41, compact=True)
host = "https://taxonomy.api.jobtechdev.se"

def graphql_get(query, variables):
    response = requests.get(host + "/v1/taxonomy/graphql",
                            params={"query": query, "variables": json.dumps(variables)})
    return response.json()

The function graphql_get is a helper function to keep the rest of this code concise.

We can use it to get all versions:

version_data = graphql_get("""query MyQuery {
  versions {
    id
  }
}""", {})

versions = [x["id"] for x in version_data["data"]["versions"]]
versions.sort()

pp.pprint(versions)

Output:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

Let's query the number of concepts per version:

query = """query MyQuery($version: VersionRef) {
  concepts(include_deprecated: true, version: $version) {
    id
    deprecated
    preferred_label
    replaces {
      id
    }
    replaced_by {
      id
    }
  }
}"""

version_concepts = {}

for v in versions:
    cdata = graphql_get(query, {"version": v})
    concepts = cdata["data"]["concepts"]
    version_concepts[v] = {c["id"]:c for c in concepts}
    n = len(concepts)

print(f"Loaded {len(version_concepts)} versions.")
print("Number of concepts per version: ")
pp.pprint([len(version_concepts[v]) for v in versions])

Output:

Loaded 22 versions.
Number of concepts per version: 
[14715, 18227, 18361, 18406, 18418,
 21365, 24291, 26352, 26352, 26546,
 28560, 41130, 41179, 41436, 41816,
 41960, 41960, 41998, 42227, 42584,
 44942, 45647]

We see that the number of concepts increase with every version if we count both deprecated and non-deprecated concepts. We would expect that the set of concepts in the last version is the set of all concepts ever seen:

all_ids = {k for concepts in version_concepts.values()
           for k in concepts.keys()}
last_ids = {k for k in version_concepts[versions[-1]].keys()}

print("Equal" if last_ids == all_ids else "Not equal")

Output:

Equal

It would be interesting to study the history of every concept. We will create a map from every concept id to its value in every version:

concept_histories = {concept_id:{v:cs[concept_id]
                                 for v,cs in version_concepts.items()
                                 if concept_id in cs}
                     for concept_id in all_ids}

There are for sure going to be some concepts that have been deprecated. We will start by looking at concepts that have not been replaced by any other concepts. To filter the concepts, we write two helper functions has_deprecated and has_replaced_by.


def map_has_value(m, pred):
    for k, v in m.items():
        if pred(v):
            return True
    return False


def has_deprecated(history):
    return map_has_value(history, lambda concept: concept["deprecated"])

def has_replaced_by(history):
    return map_has_value(history, lambda concept: 0 < len(concept["replaced_by"]))
    
histories_of_interest = [(concept_id, history)
                         for concept_id, history in concept_histories.items()
                         if has_deprecated(history) and not(has_replaced_by(history))]

(concept_id, history) = histories_of_interest[0]

exist_versions = [v for v in versions if v in history]
deprecated_in_versions = [v for v in exist_versions if history[v]["deprecated"]]

label = history[exist_versions[0]]["preferred_label"]

print(f"Have a look at concept {concept_id} with first label {label}")
print(f"It exists in versions {exist_versions}")
print(f"It is marked as deprecated in {deprecated_in_versions}")

Output:

Have a look at concept Tgsx_o7B_Tpo with first label Musik
It exists in versions [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
It is marked as deprecated in [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

The above concept does not have any replacements (as its replaced_by value is always empty). Let's look at an example where replaced_by has a value:

histories_of_interest = [(concept_id, history)
                         for concept_id, history in concept_histories.items()
                         if has_deprecated(history) and has_replaced_by(history)]

(concept_id, history) = histories_of_interest[0]

exist_versions = [v for v in versions if v in history]
deprecated_in_versions = [v for v in exist_versions if history[v]["deprecated"]]

import itertools

replaced_by_id_version_pairs = [(r["id"],v) for v,concept in history.items() for r in concept["replaced_by"]]
groups = itertools.groupby(replaced_by_id_version_pairs, lambda kv: kv[0])
replaced_by_id_version_map = {id:[p[1] for p in pairs]
                              for id, pairs in groups}

label = history[exist_versions[0]]["preferred_label"]

print(f"Have a look at concept {concept_id} with first label {label}")
print(f"It exists in versions {exist_versions}")
print(f"It is marked as deprecated in {deprecated_in_versions}")
print("It has been replaced by")
for id, versions in replaced_by_id_version_map.items():
    print(f" * Concept {id} in versions {versions}")

Output:

Have a look at concept AWt8_idw_DYJ with first label Odlare av trädgårds- och jordbruksväxter, frukt och bär
It exists in versions [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
It is marked as deprecated in [13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
It has been replaced by
 * Concept Wv97_wRL_3LS in versions [13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

It would be interesting to take a look at a replacement concept.

(rid, rversions) = next(iter(replaced_by_id_version_map.items()))
rconcept = concept_histories[rid][rversions[0]]
pp.pprint(rconcept)

Output:

{'deprecated': False,
 'id': 'Wv97_wRL_3LS',
 'preferred_label': 'Odlare av '
                    'trädgårds- och '
                    'jordbruksväxter '
                    'samt djuruppfödare '
                    'blandad drift',
 'replaced_by': [],
 'replaces': [{'id': 'AWt8_idw_DYJ'}]}

Let's also see if it does indeed replace the previous concept:


replaces_id_set = {c["id"] for c in rconcept["replaces"]}
print(f"Is it true that {rid} replaces {concept_id}? {concept_id in replaces_id_set}")

Output:

Is it true that Wv97_wRL_3LS replaces AWt8_idw_DYJ? True

Hopefully, you will understand how versions and deprecation works now.