Hosted on SourceForge
and licensed under
Creative Commons GNU GPL.
This documentation will be expanded during Q1 2008. In the meantime, the notes below tackle some of the main design aspects of SportsDB. And the schematic database diagram and generated table- and field-level documentation found on the Specification Page are also helpful.
As can be seen on the Schematic Diagram, tables that are closely related to one another are boxed together as groups. Those groups that are applicable to any and all sorts of sports run down the left side of the diagram, and those that are specific to a particular sport are presented down the right side. This distinction between "core" and "specific" properties is also made in the companion SportsML standard. Also as with SportsML, even the specific-sport tables and fields are constructed in a consistent manner, so that once software is written for one sport, it is readily adaptable to work for other sports.
In many architectures, a SportsDB gets populated by a piece of middleware that processes a syndicated sports data file, preferably in XML, and even more preferably in SportsML. Of course, this is not the only way that a SportsDB could get filled with information. Custom "CRUD" systems (CReate/Update/Delete) can be built to load data directly into various tables. In such cases, tables in the documents group would likely remain untouched.
But for those cases in which SportsML messages come through and get parsed and loaded into SportsDB, then tables within the documents group hold all the info about each file, including filename, file path, and metadata.
| Selected Tables from Documents Group | |
|---|---|
| documents | One row per document coming down the feed. |
| document_contents | Stores contents of the abstract (if any), and Path to the full SportsML |
| teams_documents, persons_documents, affiliations_documents, document_fixtures | Stores metadata about each document |
| events_documents | Stores the event-key (unique game ID) for any event listed in each document |
| latest_revisions | Stores the latest doc-id for any given revision-id (allowing you to track the latest version of a document) |
| Selected Fields in the Documents Table | |
|---|---|
| doc_id | The internal string from the document that purports to be its global, unique key |
| date_time | The datetime that this document was published |
| title | A displayable title for search results |
| language | The language that prose within the document was published in. |
| publisher_id | The publisher of the document |
| document_fixture_id | Very important. Used for document type. |
| db_loading_date_time | When the document was loaded into SportsDB |
| Selected Tables from Events Group | |
|---|---|
| events | One row per event. |
| participants_events | One row per participant -- could be a team, as with basketball games, or a player, as with a typical golf match. |
| periods | One row per participant per period/inning/quarter. Includes the subscore for that participant (e.g., the number of points scored by the New York Knicks in the 2nd Quarter of a basketball game) |
| sub_periods | One row for each sub-period in the period. For example, the number of points scored by Pete Sampras in the 4th game of the 5th set of a tennis match) |
| affiliations_events | The league(s) this event is a part of. |
| events_sub_seasons | The sub_season (and by extension, season) that this event is a part of. |
| Selected Fields from Events Table | |
|---|---|
| event_key | Unique for any given publisher-key. All event_keys for the entire season are generally mapped-out ahead of time. |
| site_id | The location the event is played at. |
| start_date_time | Time the game is/was scheduled to start |
| site_alignment | Whether the site is the home-team's site, or is a neutral site |
| event_status | Could be pre-event, mid-event, post-event, postponed, cancelled, etc., |
| last_update | The date_time of the most recent document used to update the properties, scores, and/or stats of this event. |
(info to come.)
The id field
SportsDB allows for id fields to be assigned to most tables. These id's are intended to be database-assigned sequence numbers, as opposed to commonly shared "keys". Hence, if SportsDB user Bill loads a SportsML file into his database with stats on Mike Piazza (player-key piazza01), then his system may generate an id for that player of 903112 (because that's the "next free id value" on his system. However, if Jane loads in the same file into her DB, her system may give the same player an id of 350884.
While this system of "Local id's" may seem to complicate the process of transferring whole database from one computer to another, it's accepted and common DB design practice to use such system-dependent id's to act as foreign keys linking one table to another.
This frees the software from having to worry about generating complex unique codes, or combining separate fields in order to cross-link tables.
In practice, migrating tables to another database is simple, either by keeping the id's as they are, or by generating them anew in the new database tables. They are meant to be local to a specific database instance, and should not be used in public-facing identifiers such as URLs.
The *_key fields
The commonly shared version of player identifiers (e.g., "piazza01"), known in SportsML parlance as the player-key, also gets stored in SportsDB, but should not be used for cross-linking tables.
It's a common headache in sports data publishing that the world hasn't yet agreed upon a single universal and unique identification system for people (preferably assigned at birth, and barcoded across a person's chest :-).
SportsDB includes a Universal Alias System for person-keys, team-keys, and other keys to identify when two publishers use different keys to refer to the same player, team, etc. As an example:
Observe these two entries in the "persons" table, offering two different player-keys for George Brett, one from one publisher (Stats, LLC, publisher_id = 1), and one from another (Retrosheet, publisher_id = 2)
| id | publisher_id | person_key |
|---|---|---|
| 9812 | 1 (id for stats.com) | 147 |
| 10391 | 2 (id for retrosheet.org) | gbrett01 |
A "key_aliases" table would then have two rows corresponding to these two rows in the "persons" table:
| id | key_id | key_root_id |
|---|---|---|
| 5801 | 9812 | 450 |
| 5802 | 10391 | 450 |
Note that these two rows share a common key_root_id value. That's what tells you that these two rows are providing aliases for the exact same entity. Which type of entity is it? (e.g., is it a person? is it a team? a league?) Look in the "key_roots" table to find out:
| id | key_type |
|---|---|
| 450 | persons |
So here we have an example of cross-table polymorphism. Usually, with SportsDB, the *_type and *_id fields are companions in the very same table. The *_type field tells you which table to join with, and the *_id field tells you which row in that table. But in the spirit of reducing internal database redundancy (and reducing an opportunity for error), the key_type field is placed in the key_roots table.
How could SportsDB get populated with key aliases information? People could publish Equivalency Tables that look like this:
| key_type | stats.com | retrosheet.org | sportsnetwork.com | ... etc etc |
|---|---|---|---|---|
| player | 147 | gbrett01 | l.mlb.com-p.4125 |
Note that this isn't a database table... it's just a tab-delimited file that people could maintain using spreadsheets (or they could export it from some other data storage system).
The SportsCodes.org initiative, a companion to SportsDB, aims to be a wiki-esque Player ID Equivalency maintainer of sorts, one that would export updates in this format.
A SportsDB Loading Program could know how to read these Equivalency Tables, and know when to do an UPDATE of rows in "key_aliases" and "key_roots", verses when to do an INSERT of a new row. As far as DELETES go, a SportsDB user would likely just need a "clear things out" function that deletes ALL rows for a particular publisher_key (optionally narrowed down by key_type).