, 5 min read
Considerations on GDGs on Linux
1. Introduction. The mainframe knows the notion of GDGs, i.e., Generation Data Groups. These are groups of files, which are handled as almost a single entity.
Some notions. A GDG file consists of three parts in its file name:
<GDG-Base>.G<generation>.V<version>
The limit is the maximal number of generations we want to keep.
Nr. | Notion | Meaning |
---|---|---|
1. | GDG base | common prefix |
2. | generation | number in range 0001 to 9999, default is 0001; also called absolute generation |
3. | relative generation | number in range -9998 to +9998; added to absolute generation |
4. | version | number in range 00 to 99, default is 00 |
5. | limit | maximal number of files for a GDG, e.g., 5 |
Here is an examples of a GDG with base DONKNUTH
and limit 3:
DONKNUTH.G0001V00
DONKNUTH.G0002V00
DONKNUTH.G0003V00
Accessing a generation is done using the paranthesis notation on MVS and using relative generations.
- So
DONKNTUH(0)
isDONKNUTH.G0003V00
. The "current" generation. - Similarly,
DONKNTUH(-1)
isDONKNUTH.G0002V00
. This is the "previous" generation. - And,
DONKNTUH(-2)
isDONKNUTH.G0001V00
. This is the before-previous generation. - A new generation is created by specifying
DONKNTUH(+1)
. That would beDONKNUTH.G0004V00
.
Here is another example of GDG with base DRitchie
and limit 8:
DRitchie.g9996v00
DRitchie.g9997v00
DRitchie.g9998v00
DRitchie.g9999v00
DRitchie.g0001v00
DRitchie.g0002v00
DRitchie.g0003v00
DRitchie.g0004v00
The generation number 0000
is not used.
In above example DRitchie.g0000v00
will never exist.
I.e., there is a jump from 9999
to 0001
in absolute generations.
The version of GDG is of no relevance in our following discussions.
Its purpose is to change a specific generation and alter the file.
For example, you could change DRitchie.g0003v00
to DRitchie.g0003v01
.
In that case DRitchie.g0003v00
will be gone.
Only DRitchie.g0003v01
remains, and, of course, the other generations.
See Absolute generation and version numbers.
2. Solution sketch. This short memo outlines how this can be implemented with the help of SQLite. SQLite is particularly apt to this task as it provides:
- Concurrency processing, i.e., protecting against accidental overwriting of parallel tasks
- Keeping all information in a single file
The solution is now to have two sets of files:
- A number of data files, which actually hold the data, named xxx.g0001v00, xxx.g0002v00, xxx.g0003v00, etc.
- An SQLite file xxx.db for managing the generation, i.e., the actual file with generation number
The version part is kept constant to V00
.
We stick to the lowercase file suffixes.
Mass lowercasing of file names described an approach to lowercase huge number of files.
In addition there will be a single binary gdg
, which controls the access to above sets of files.
Program gdg
is entirely controlled by command line arguments.
gdg
is written in C and accesses the SQLite database file and returns a string, which is the file name of the generation in question.
As gdg
is used in many shell scripts the startup time of gdg
is important and other languages are too slow,
see Performance Comparison C vs. Lua vs. LuaJIT vs. Java.
gdg
takes two input strings via command line:
- The name of the SQLite database for the GDG, e.g.,
DONKNUTH.db
- The requested relative generation number, e.g., +2, or simply 2
For the example with DONKNUTH.db
and +2, it will return
DONKNUTH.G0005V00
Likewise:
echo $(gdg DONKNUTH.db -2)
will print DONKNUTH.G0001V00
.
To create an entire new GDG use command line flag -c
and specify GDG base and limit.
E.g., creating above example DRitchie
with limit 8:
gdg -c DRitchie.db 8
If the limit is omitted, it is assumed to be 1.
Similarly, the limit is changed by specifying command line argument -c
.
If the new limit is smaller than the previous one, multiple files might get deleted by gdg
.
gdg
does not use the paranthesis notation from MVS, as paranthesis have a special meaning with the shell, and calling gdg
from a shell is the most common use-case.
It should be noted that gdg
just reads strings and produces strings.
It does not produce any files.
However, gdg
deletes files!
I.e., according the specified limit, surplus files are deleted.
Forever.
Whenever a positive relative generation is specified all the files which do not match the limit criterion are deleted by gdg
.
Deleting the surplus files is done using scandir()
and unlink()
.
gdg
can take an optional third argument: a program name.
This is just a string stored in the genhist
table, see below.
3. Data model. The SQLite database will store below table called genmgt
("generation management").
Nr. | Column | type | nullable | Example or meaning |
---|---|---|---|---|
1 | base | text | not null | GDG base: file name excluding directory, e.g., DONKNUTH |
2 | generation | int | not null | current absolute generation, e.g., 73; starts with 1 |
3 | limit | int | not null | total number of generations allowed, e.g., 5; default is 1 |
As only the file name without path of the GDG is stored, the GDG can be moved around the filesystem.
We might be interested in the history of accesses.
I.e., who accessed the GDG with what program or job.
For this we can use the table genhist
("generation history").
This table is entirely optional and not required.
Nr. | Column | type | nullable | Example or meaning |
---|---|---|---|---|
1 | generation | int | not null | historic absolute generation number , e.g., 71 |
2 | uid | int | not null | user-id from getuid() |
3 | gid | int | not null | group-id from getgid() |
4 | pgmname | text | null | name of program or job, which accessed the i-th generation |
5 | atime | date | not null | access time |
The following SQL statements will be used:
- Create an entire new GDG base:
insert into genmgt (...)
- Specify relative generation, e.g., (+1):
select generation, limit from genmgt
update genmgt set generation = ...
- Change limit:
update genmgt set limit = ...
4. Outlook. Each GDG element might be a so called partitioned dataset.
In UNIX jargon this would be a directory.
I.e., so ideally gdg
can also handle a GDG of directories.
5. Effort estimation.
Program gdg
can be written in less than three mandays.