ASF Pelican data model

In 2019 Infra created ASF-Pelican as a structure and template for projects to use to build their websites, and for the ASF's own website.

In 2024, Infra moved from ASF-Pelican to the ASF Infrastructure Pelican Action GitHub Action to perform the same functions without being closely tied to BuildBot. The repository for this GHA is github.com/apache/infrastructure-actions/tree/main/pelican.

The material below is relevant to the older Pelican build system and the Pelican GitHub Action.


## ASF Data

If your site includes the asfdata.py plugin, the Pelican site generator reads instructions from it during initialization and creates shared metadata that is available for all pages. It is particularly critical for ezmd pages that contain directives.

The pelicanconf.yaml file contains the following:

setup:
  data: asfdata.yaml
  • data is a .yaml file of metadata instructions.

Within the plugin there are three kinds of data transformations:

  1. Constant key-value pairs.
  2. Specific sequences that are custom code specific to the datasource:
    • Twitter feed uses the Twitter Recent Tweet API
    • Blogs reads a Roller Atom feed in XML
    • ECCN reads export notifications from a .yaml file
  3. Multiple data models derived from a single .yaml or json file:
    • Committee info, which has Board, Officer, Committee, and Project information
    • Podling info, which has Incubator podling information

Key value metadata

These are provided in the ASF_DATA['data'] file:

# key-value pairs
code_lines: 227M
code_changed: 4.2B
code_commits: 4.1M
asf_members: 820
# For use as nnn+ or 'more than nnn'
asf_members_rounded: 800
asf_committers: 8,100
asf_contributors: 40,000
asf_people: 488,000
com_initiatives: 350
com_projects: 300
com_podlings: 37
com_downloads: ~2 Petabytes
com_emails: 24M
com_mailinglists: 1,400
com_pageviews: 35M

Recent tweets

This sequence uses specific code:

# used on index.ezmd
twitter:
  # load, transform, and create a sequence of tweets
  handle: 'TheASF'
  count: 1

The key method for reading recent tweets from the API.

# retrieve the last count recent tweets from the handle.
def process_twitter(handle, count):
    print(f'-----\ntwitter feed: {handle}')
    bearer_token = twitter_auth()
    query = f'from:{handle}'
    tweet_fields = 'tweet.fields=author_id'
    url = f'https://api.twitter.com/2/tweets/search/recent?query={query}&{tweet_fields}'
    headers = {'Authorization': f'Bearer {bearer_token}'}
    load = connect_to_endpoint(url, headers)
    reference = sequence_list('twitter', load['data'])
    if load['meta']['result_count'] < count:
        v = reference
    else:
        v = reference[:count]
    return v

Recent blog posts

The main Apache site uses three different blog feeds. Here is how the site calls one, as an example:

# used on index.ezmd
foundation:
  # load, transform, and create a sequence of foundation blogs
  blog: https://blogs.apache.org/foundation/feed/entries/atom
  count: 1

The site is only interested in the most recent post's title and id/url.

# retrieve blog posts from an Atom feed.
def process_blog(feed, count, debug):
    print(f'blog feed: {feed}')
    content = requests.get(feed).text
    dom = xml.dom.minidom.parseString(content)
    # dive into the dom to get 'entry' elements
    entries = dom.getElementsByTagName('entry')
    # we only want count many from the beginning
    entries = entries[:count]
    v = [ ]
    for entry in entries:
        if debug:
            print(entry.tagName)
        # we only want the title and href
        v.append(
            {
                'id': get_element_text(entry, 'id'),
                'title': get_element_text(entry, 'title'),
            }
        )
    if debug:
        for s in v:
            print(s)

    return [ Blog(href=s['id'],
                  title=s['title'])
             for s in v ]

Note the use of the Blog class. Its definition is in the code for the next example.

ECCN Data Sequences

The ECCN data matrix is records a project's bisnotice emails regarding encryption code. It has four layers: projects, products, versions, and controlled sources. This information is primarily of interest to the main Apache site, but may be useful to those working on project websites in showing how the system secures and processes data.

Here are the ASF_DATA directives

# used on licenses/exports/index.ezmd
eccn:
  # load, transform, and create a four tiered structure of sequence objects
  # projects, products, versions, and sources
  file: data/eccn/eccnmatrix.yaml

Here is a sample of the data for the first project.

eccnmatrix:
  - href: 'http://accumulo.apache.org/'
    name: Apache Accumulo Project
    contact: John Vines
    product:
      - name: Apache Accumulo Project
        versions:
          - version: development
            eccn: 5D002
            source:
              - href: 'https://git-wip-us.apache.org/repos/asf/accumulo.git'
                manufacturer: ASF
                why: Designed for use with built in Java encryption libraries
              - href: 'http://www.bouncycastle.org/download/bcmail-jdk15-137.tar.gz'
                manufacturer: Bouncy Castle
                why: General-purpose encryption library for Java 1.5
          - version: 1.6.0 and on
            eccn: 5D002
            source:
              - href: 'https://git-wip-us.apache.org/repos/asf/accumulo.git'
                manufacturer: ASF
                why: Designed for use with built in Java encryption libraries
              - href: 'http://www.bouncycastle.org/download/bcmail-jdk15-137.tar.gz'
                manufacturer: Bouncy Castle
                why: General-purpose encryption library for Java 1.5
          - version: 1.5.x
            eccn: 5D002
            source:
              - href: 'https://git-wip-us.apache.org/repos/asf/accumulo.git'
                manufacturer: ASF
                why: Designed for use with built in Java encryption libraries
  ...

Here is the custom processing for the ECCN data matrix. Note the wrappers.

# create sequence of sequences of ASF ECCN data.
def process_eccn(fname):
    print('-----\nECCN:', fname)
    j = yaml.safe_load(open(fname))

    # versions have zero or more controlled sources
    def make_sources(sources):
        return [ Source(href=s['href'],
                        manufacturer=s['manufacturer'],
                        why=s['why'])
                 for s in sources ]

    # products have one or more versions
    def make_versions(vsns):
        return [ Version(version=v['version'],
                         eccn=v['eccn'],
                         source=make_sources(v.get('source', [ ])),
                         )
                 for v in sorted(vsns,
                                 key=operator.itemgetter('version')) ]

    # projects have one or more products
    def make_products(prods):
        return [ Product(name=p['name'],
                         versions=make_versions(p['versions']),
                         )
                 for p in sorted(prods,
                                 key=operator.itemgetter('name')) ]

    # eccn matrix has one or more projects
    return [ Project(name=proj['name'],
                     href=proj['href'],
                     contact=proj['contact'],
                     product=make_products(proj['product']))
             for proj in sorted(j['eccnmatrix'],
                                key=operator.itemgetter('name')) ]


# object wrappers
class wrapper:
    def __init__(self, **kw):
        vars(self).update(kw)

# Improve the names when failures occur.
class Source(wrapper): pass
class Version(wrapper): pass
class Product(wrapper): pass
class Project(wrapper): pass
class Blog(wrapper): pass

Committee Info

The committee info data contains three data structures: - officers, committees, and the board of directors. Again, this is primarily of interest to the main Apache site.

From these we derive:

  • Board of Directors sequence
  • Officers sequence
  • Committees / PMCs sequence
  • CI - Officers including PMC VPs dictionary
  • Projects sequence
  • Featured projects - a random sample of three projects.
  • Project directory columns

Board of Directors

The Board of Directors sequence is derived first.

  ...
  "board": {
    "roster": {
      "bdelacretaz": {
        "name": "Bertrand Delacretaz"
      },
      "fielding": {
        "name": "Roy T. Fielding"
      },
      "sharan": {
        "name": "Sharan Foga"
      },
      "jmclean": {
        "name": "Justin Mclean"
      },
      "rubys": {
        "name": "Sam Ruby"
      },
      "clr": {
        "name": "Craig L Russell"
      },
      "rvs": {
        "name": "Roman Shaposhnik"
      },
      "striker": {
        "name": "Sander Striker"
      },
      "wusheng": {
        "name": "Sheng Wu"
      }
    }
  }
  ...

Here are the directives

ci:
  # load, transform, and create data sequences from committee info 
  url: https://whimsy.apache.org/public/committee-info.json
  board:
    # used on /foundation/ and /foundation/board/
    description: 'Board of Directors sequence'
    # select ci['board']['roster'] for the sequence
    path: board.roster

Here is the Python code used to select the board roster from committee info.

    # select sub dictionary
    if 'path' in sequence:
        print(f'path: {sequence["path"]}')
        parts = sequence['path'].split('.')
        for part in parts:
            reference = reference[part]

The following procedure converts a dictionary into a sequence of objects with attributes.

# convert a dictionary into a sequence (list)
def sequence_dict(seq, reference):
    sequence = [ ]
    for refs in reference:
        # converting dicts into objects with attributes. Ignore non-dict content.
        if isinstance(reference[refs], dict):
            # put the key of the dict  into the dictionary
            reference[refs]['key_id'] = refs
            for item in reference[refs]:
                if isinstance(reference[refs][item], bool):
                    # fix up any Boolean values to be ezt.boolean - essentially True -> "yes"
                    reference[refs][item] = ezt.boolean(reference[refs][item])
            # convert the dict into an object with attributes and append to the sequence
            sequence.append(type(seq, (), reference[refs]))
    return sequence

Officers

How the system assembles the list of Foundation officers.

    "boardchair": {
      "display_name": "Board Chair",
      "paragraph": "Executive Officers",
      "roster": {
        "striker": {
          "name": "Sander Striker"
        }
      }
    },

Committees / PMC Chairs. Roster and Reporting is omitted.

    ...
    "zookeeper": {
      "display_name": "ZooKeeper",
      "site": "http://zookeeper.apache.org/",
      "description": "Centralized service for maintaining configuration information",
      "mail_list": "zookeeper",
      "established": "11/2010",
      "chair": {
        "fpj": {
          "name": "Flavio Junqueira"
        }
      },
      "pmc": true
    },
    "legal": {
      "display_name": "Legal Affairs",
      "site": null,
      "description": null,
      "mail_list": "legal",
      "established": "03/2007",
      "chair": {
        "rvs": {
          "name": "Roman Shaposhnik"
        }
      },
      "pmc": false,
      "paragraph": "Board Committees"
    },
    ...

Here are the directives that create metadata models from the above data.

  officers:
    description: 'Foundation Officers sequence'
    # select ci['officers'] for the sequence
    path: officers
    # convert ci['officers']['roster']
    asfid: roster
  committees:
    description: 'Foundation Committees sequence'
    # ci['committees']
    path: committees
    # remove all report and roster dictionaries from committees
    trim: report,roster
    # convert ci['committees']['chair']
    asfid: chair
  ci:
    # used on /foundation/
    description: 'Dictionary of officers and committees'
    # save a merged dictionary version of these sequences.
    dictionary: officers,committees

We've already seen the code for path above. Here is the code that invokes trim and asfid:

    # remove irrelevant keys
    if 'trim' in sequence:
        print(f'trim: {sequence["trim"]}')
        parts = sequence['trim'].split(',')
        for part in parts:
            remove_part(reference, part)

    # transform roster and chair patterns
    if 'asfid' in sequence:
        print(f'asfid: {sequence["asfid"]}')
        asfid_part(reference, sequence['asfid'])

Here is the code that trims a key from a dictionary:

# remove parts of a data source we don't want ro access
def remove_part(reference, part):
    for refs in reference:
        if refs == part:
            del reference[part]
            return
        elif isinstance(reference[refs], dict):
            remove_part(reference[refs], part)

Here is the code that rearranges the chair or officer (roster) so that the dictionary is flattened before sequencing.

# rotate a roster list singleton into an name and availid 
def asfid_part(reference, part):
    for refs in reference:
        fix = reference[refs][part]
        for k in fix:
            availid = k
            name = fix[k]['name']
        reference[refs][part] = name
        reference[refs]['availid'] = availid

The ci data model is a dictionary we need to improve the display of officers on the ASF main site.

    # this dictionary is derived from sub-dictionaries
    if 'dictionary' in sequence:
        print(f'dictionary: {sequence["dictionary"]}')
        reference = { }
        paths = sequence['dictionary'].split(',')
        # create a dictionary from the keys in one or more sub-dictionaries
        for path in paths:
            for key in load[path]:
                reference[key] = load[path][key]
        # dictionary result, do not sequence
        is_dictionary = True

Projects / PMCs

For sequences about projects we first derive a project list from the committee list. We supplement it with each project's initial letter to provide an alphabetical project index.

  projects:
    description: 'Current Projects'
    # ci['committees']
    path: committees
    # select only where 'pmc' is true.
    where: pmc
    # sort by project name
    alpha: display_name

Here's the code that trims non-pmcs from the committee list.

# trim out parts of a data source that don't match part = True
def where_parts(reference, part):
    # currently only works on True parts
    # if we trim as we go we invalidate the iterator. Instead create a deletion list.
    filtered = [ ]
    # first find the list that needs to be trimmed.
    for refs in reference:
        if not reference[refs][part]:
            filtered.append(refs)
    # remove the parts to be trimmed.
    for refs in filtered:
        del reference[refs]

This code provides an alphabetical index for the product index derived below.

# perform alphabetation. HTTP Server is special and is put before 'A'
def alpha_part(reference, part):
    for refs in reference:
        name = reference[refs][part]
        if name == 'HTTP Server':
            # when sorting by letter HTTPD Server is wanted first
            letter = ' '
        else:
            letter = name[0].upper()
        reference[refs]['letter'] = letter

Featured projects

On the front page on the main ASF site we feature a random sample of projects. We also want to display a project's logo.

  featured_projs:
    # used on /
    description: 'Featured Projects'
    # base on projects sequence
    sequence: projects
    # take a random sample of 3
    random: 3
    # logo path - use apache powered by if missing
    logo: /logos/res/{}/default.png,/foundation/press/kit/poweredBy/Apache_PoweredBy.svg

Here is the code to copy a sequence, take a random sample, add the logo (or the ASF feather if there is no product logo), and, for featured podlings, adjust the name:

    # this sequence is derived from another sequence
    if 'sequence' in sequence:
        print(f'sequence: {sequence["sequence"]}')
        reference = metadata[sequence['sequence']]
        # sequences derived from prior sequences do not need to be converted to a sequence
        is_sequence = True

    # this sequence is a random sample of another sequence
    if 'random' in sequence:
        print(f'random: {sequence["random"]}')
        if is_sequence:
            reference = random.sample(reference, sequence['random'])
        else:
            print(f'{seq} - random requires an existing sequence to sample')

    # for a project or podling see if the logo exists w/HEAD and set the relative path.
    if 'logo' in sequence:
        print(f'logo: {sequence["logo"]}')
        if is_sequence:
            # determine the project or podling logo
            reference = add_logo(reference, sequence['logo'])
            if seq == 'featured_pods':
                # for podlings strip "Apache" from the beginning and "(incubating)" from the end.
                # this is Sally's request
                for item in reference:
                    setattr(item, 'name', ' '.join(item.name.split(' ')[1:-1]))
        else:
            print(f'{seq} - logo requires an existing sequence')

Here is the detailed check to see if a project or podling logo is available.

# add logo attribute with HEAD check for existence. If nonexistent use default.
def add_logo(reference, part):
    # split between logo pattern and default.
    parts = part.split(',')
    for item in reference:
        # the logo pattern includes a place to insert the project/podling key
        logo = (parts[0].format(item.key_id))
        # HEAD request
        response = requests.head('https://www.apache.org/' + logo)
        if response.status_code != 200:
            # logo not found - use the default logo
            logo = parts[1]
        # save the logo path as an attribute
        setattr(item, 'logo', logo)
    return reference

Project index

At the bottom of the main page of the ASF site we display a project index that includes headings for each letter of the alphabet:

  pl:
    # used on /
    description: 'Project List Columns'
    # base on projects sequence
    sequence: projects
    # split into 6 column sequence adding letters of the alphabet and putting httpd first
    split: 6

This code derives the sequences for the six columns in the display. The output metadata is pl_0, pl_1, pl_2, pl_3, pl_4, and pl_5:

# split a list into equal sized columns. Adds letter breaks in the alphabetical sequence.
def split_list(metadata, seq, reference, split):
    # copy sequence
    sequence = list(reference)
    # sort the copy
    sequence.sort(key=lambda x: (x.letter, x.display_name))
    # size of list
    size = len(sequence)
    # size of columns
    percol = int((size+26+split-1)/split)
    # positions
    start = nseq = nrow = 0
    letter = ' '
    # create each column
    for column in range(split):
        subsequence = [ ]
        end = min(size+26, start+percol)
        while nrow < end:
            if letter < sequence[nseq].letter:
                # new letter - add a letter break into the column. If a letter has no content it is skipped
                letter = sequence[nseq].letter
                subsequence.append(type(seq, (), { 'letter': letter, 'display_name': letter }))
            else:
                # add the project into the sequence
                subsequence.append(sequence[nseq])
                nseq = nseq+1
            nrow = nrow+1
        # save the column sequence in the metadata
        metadata[f'{seq}_{column}'] = subsequence
        start = end
    if nseq < size:
        print(f'WARNING: {seq} not all of sequence consumed: short {size-nseq} projects')

Adding a data source

Before you code to add a data source, evaluate which of the above patterns it fits.

  • Is it a custom pattern as for Twitter, blogs, and ECCN?
  • Can it follow the sequence of directives used for committee iInfo?
  • If it is a sequence of directives, does it need a new one?

Adding a custom data source

    # Lift data from ASF_DATA['data'] into METADATA
    if 'data' in asf_data:
        print(f'Processing {asf_data["data"]}')
        config_data = read_config(asf_data['data'])
        for key in config_data:
            # first check for data that is a singleton with special handling
            if key == 'eccn':
                # process eccn data
                fname = config_data[key]['file']
                metadata[key] = v = process_eccn(fname)
                if debug:
                    print('ECCN V:', v)
                continue

            if key == 'twitter':
                # process twitter data
                # if we decide to have multiple twitter feeds available then move next to blog below
                handle = config_data[key]['handle']
                count = config_data[key]['count']
                metadata[key] = v = process_twitter(handle, count)
                if debug:
                    print('TWITTER V:', v)
                continue

For a custom singletons add your call to your new process_X code here, following the pattern for ECCN or Twitter.

       	    value = config_data[key]
            if isinstance(value, dict):
                # dictionaries may have multiple data structures that are processed with a sequence of actions
                # into multiple sequences and dictionaries.
                print(f'-----\n{key} creates one or more sequences')
                if debug:
                    print(value)
                # special cases that are multiple are processed first
                if 'blog' in value:
                    # process blog feed
                    feed = config_data[key]['blog']
                    count = config_data[key]['count']
                    metadata[key] = v = process_blog(feed, count, debug)
                    if debug:
                        print('BLOG V:', v)
                    continue

For custom non-singletons add your call to your new process_X code here, following the pattern for a blog feed.

Adding a directive to the sequence process

If you are adding a directive, add it to process_sequence in the place you need it (process order can matter):

# process sequencing transformations to the data source
def process_sequence(metadata, seq, sequence, load, debug):
    reference = load
    # has been converted to a sequence
    is_sequence = False
    # has been converted to a dictionary - won't be made into a sequence
    is_dictionary = False
    # save metadata at the end
    save_metadata = True

    # description
    if 'description' in sequence:
        print(f'{seq}: {sequence["description"]}')

    # select sub dictionary
    if 'path' in sequence:
        print(f'path: {sequence["path"]}')
        parts = sequence['path'].split('.')
        for part in parts:
            reference = reference[part]

    # filter dictionary by attribute value. if filter is false discard
    if 'where' in sequence:
        print(f'where: {sequence["where"]}')
        where_parts(reference, sequence['where'])

    # remove irrelevant keys
    if 'trim' in sequence:
        print(f'trim: {sequence["trim"]}')
        parts = sequence['trim'].split(',')
        for part in parts:
            remove_part(reference, part)

    # transform roster and chair patterns
    if 'asfid' in sequence:
        print(f'asfid: {sequence["asfid"]}')
        asfid_part(reference, sequence['asfid'])

    # add first letter of alphabetic categories
    if 'alpha' in sequence:
        print(f'alpha: {sequence["alpha"]}')
        alpha_part(reference, sequence['alpha'])

    # this dictionary is derived from sub-dictionaries
    if 'dictionary' in sequence:
        print(f'dictionary: {sequence["dictionary"]}')
        reference = { }
        paths = sequence['dictionary'].split(',')
        # create a dictionary from the keys in one or more sub-dictionaries
        for path in paths:
            for key in load[path]:
                reference[key] = load[path][key]
        # dictionary result, do not sequence
        is_dictionary = True

    # this sequence is derived from another sequence
    if 'sequence' in sequence:
        print(f'sequence: {sequence["sequence"]}')
        reference = metadata[sequence['sequence']]
        # sequences derived from prior sequences do not need to be converted to a sequence
        is_sequence = True

    # this sequence is a random sample of another sequence
    if 'random' in sequence:
        print(f'random: {sequence["random"]}')
        if is_sequence:
            reference = random.sample(reference, sequence['random'])
        else:
            print(f'{seq} - random requires an existing sequence to sample')

    # for a project or podling see if the logo exists w/HEAD and set the relative path.
    if 'logo' in sequence:
        print(f'logo: {sequence["logo"]}')
        if is_sequence:
            # determine the project or podling logo
            reference = add_logo(reference, sequence['logo'])
            if seq == 'featured_pods':
                # for podlings strip "Apache" from the beginning and "(incubating)" from the end.
                # this is Sally's request
                for item in reference:
                    setattr(item, 'name', ' '.join(item.name.split(' ')[1:-1]))
        else:
            print(f'{seq} - logo requires an existing sequence')

    # this sequence is a sorted list divided into multiple columns
    if 'split' in sequence:
        print(f'split: {sequence["split"]}')
        if is_sequence:
            # create a sequence for each column
            split_list(metadata, seq, reference, sequence['split'])
            # created column sequences are already saved to metadata so do not do so later
            save_metadata = False
        else:
            print(f'{seq} - split requires an existing sequence to split')

Copyright 2024, The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache® and the Apache feather logo are trademarks of The Apache Software Foundation.