# Input data This page provides a guide on how to provide input data and specify it in the configuration. There are three general source directories for input data - the ICON (-ART) repository - a separate (experiment-specific) input directory - pool directories on the clusters or the web. In the working directory, links to the respective files are created. *auto-icon* reads the `DIRECTORIES.LINK_FILES` section in the configuration (`/experiment.yml`, respectively `conf/art/experiments/.yml`) to read which files to link. ## General linked files The following subsections are present in `DIRECTORIES.LINK_FILES` section: | subsection | purpose | |------------|---------| | FILES | General input files and directories | | ICON_DATA | Parametrizations below `` | | ART | Data in `/externals/art` | ## Domain specific files Information that is specific to the grid (e.g. grid or and initial conditions files) can be provided conveniently with a separate section in the experiment config file. Standard grids, i.e. grids that are officially distributed by DWD/MPIM and listed in the [grid file server](http://icon-downloads.mpimet.mpg.de/) can be conveniently specified and used. These grids are also available on several HPC systems directly. This information is contained in the global `GRID` section. Each domain gets an own subsection, numbered subsequently `DOM01`, `DOM02`, etc. and for each domain, grid type (e.g. `G` for global), the `R` and `B` values as well as the _official grid number_ need to be provided. Further, info on external parameter files and files for initial conditions can be supplied. An example looks as follows: ```yaml GRID: #-- The FILELIST section provides a list of all input files that shall be used #-- for the current run. FILELIST: - GRID - EXTPAR - DWDFG - DWDANA - BCF #-- ART file - IAE #-- ART file - STY #-- ART file #-- Turn on, if you want to use a radiation grid (the grid has to be provided on the server as well). RADGRID: False DOM1: #-- Type (G: global, R: radiation grid, O: ocean, Nxx: nested grid numbered xx, L: LAM grid, L*: radgrid or nest for LAM, ...) #-- The type letter(s) have to correspond to the suffix letters of the grids in the list (s. above). TYPE: G #-- R and B values of the grid. Leading zeroes are ignored. R: 2 B: 4 #-- Grid number of the official grid to use (s. above). GRID_NUMBER: 12 #-- Date of the corresponding extpar dataset. The dataset has to be provided by the user or be present in the public repository. EXTPAR_DATE: 20131001 ``` A detailed overview of all possible options is provided in the template file `conf/art/experiments/template.yml`. Speicifying different file names for input files is also described there. :::{tip} With multiple domains, several values can be inferred, e.g. if `DOM01` is an R2B4 one with grid number 12, `DOM02` is expected to be R2B5 with grid number 13. Such values can be left out unless deviating from the default. ::: ### Domains Each domain corresponds to one ICON domain and thus one grid. The radiation grid is hereby treated basically the same, with the limitation that it can only contain a grid as associated file. Further, *all* parameters can be inferred for a radiation grid, when adhering to the official grids (or at least the nomencalture). :::{hint} If inferring works, you can use a radiation grid by just setting `RADGRID: True` in the `GRID` section. If it is not sufficient, you can create a full domain specification also for the radiation grid. ::: ### File names | File type | File tag | Default file name **(1)** | | --------- | -------- | --------------------- | | Grid | GRID | `icon_grid_ZZZZ_RxxByy[_T].nc` | | Extpar | EXTPAR | `icon_extpar_ZZZZ_RxxByy[_T][_][_tiles].nc` | | dwd first guess | DWDFG | ` dwdFG_RxByy_DOMii.nc` | | dwd analysis | DWDANA | ` dwdana_RxByy_DOMii.nc` | | ART input | TYP **(2)** | `ART_TYP_iconRxByy-grid_ZZZZ.nc ` | **(1)** Placeholders used here are: ```yaml ZZZZ: grid number padded to 4 digits (for ART file, set to ART_IO_SUFFIX if present) x|xx, yy: R and B values padded to 1 or 2 digits each ii: domain id T: Grid type (for GRID and EXTPAR) DATE: datestring (EXTPAR: YYYYMMDD; IFS,INC: YYYYMMDDHH (start date, see below)) _tiles: added if EXTPAR_TILES is set to true ``` **(2)** The tag is the 3-letter type as specified for art, see [ART User guide](https://www.icon-art.kit.edu/userguide/index.php?title=Input#Input_Data) for details. With the `FILENAMES` section, one can easily set the file name. In the section, for each file tag, one can set a specific file name (relative or absolute, see also the [syntax for link specification](#syntax-for-link-specification)). ```yaml FILENAMES: GRID: my_test_domain.nc EXTPAR: my_test_domain_extpar_data.nc ``` ### Grid generation and remapping With the `pre-create-grid` job, one can create a grid from scratch or from an existing grid. For using an existing grid, that one has to be included as a separate domain with the additional key `UNUSED: True`, to tell auto-icon it is only a dummy domain. For those dummy domains, other domain numbers might be useful, then the parent-child relationship can be explicitly set with the `PARENT: ` section, e.g. `PARENT: 10`. A section `GEN_PARAMS` then defines the grid to be created, where the individual keys will be transferred to the namelist of the DWD ICON tool `icongridgen`, e.g.: ```yaml GEN_PARAMS: region_type: 3 hwidth_lon: 5.0 hwidth_lat: 5.0 center_lon: 18.0 center_lat: 12.0 # min_refin_c_ctrl: 1 # max_refin_c_ctrl: 14 ``` Remapping of (nearly arbitrary) input data can be done with CDO via the `pre-remap` job. An additional `SOURCE` section has to be provided, where for each file tag (to be remapped), the source file and source file grid are presented in a list, e.g.: ```yaml SOURCE: STY: - '/path/to/source/ART_STY_iconR2B09-grid_0015.nc' - '/path/to/source/icon_grid_0015_R02B09_G.nc' ``` :::{note} If you have a LAM grid, the `pre-create-grid` job also creates the `lateral_boundary.grid.nc` for you and can remap ERA5 data to that grid. ::: :::{hint} Take a look at the template LAM_DUST for a full running example. ::: ## Syntax for link specification There is a special syntax for specifying the link names to allow for sophisticated location of input files. The simplest way of specifying an input name is to provide its file name (the TARGET of the link) as a plain string. Optional, you can specify the LINK_NAME (name of the symlink in the working directory) in the string separated by a vertical bar, i.e. `TARGET|LINK_NAME`. Hereby, TARGET can be an absolute file name, a relative file name (which is looked up at several places, see [File location order](#file-location-order)) or even a URL. However, there are a few [special characters](#special-characters) which you should not use in your file names. In addition to specifying a full file name as the TARGET (and optional a LINK_NAME), you can supply [patterns](#patterns) to match multiple files or use [placeholders](#placeholders) for convenient substitution of configuration parameters. ### Patterns The pattern syntax is used if TARGET starts with an exclamation mark (`!`), i.e. you specify `!PATTERN`. The pattern is then matched against all files in the respective input directories (including pool directories). The following patterns are evaluated (see [doc](https://docs.python.org/3/library/fnmatch.html#fnmatch.fnmatch) for details): | __Pattern__ | __Meaning__ | | ----------- | ------------ | | * | matches everything | | ? | matches any single character | | [seq] | matches any character in _seq_ | | [!seq] | matches any character not in _seq_ | As an example may serve the following: ```yaml DIRECTORIES: LINK_FILES: ART: - '!FJX_scat-*.dat' # This pattern shall link the following files: # - 'FJX_scat-aer.dat' # - 'FJX_scat-cld.dat' # - 'FJX_scat-ssa.dat' # - 'FJX_scat-UMa.dat' ``` :::{caution} If the patter is quite general, many files might match in the pool directories! ::: ### Placeholders [Autosubmit placeholders](Namelists.md#autosubmit-placeholders) can be used in all config options, and as such also in the `TARGET|LINK_NAME` field. In addition, specific placeholders for the start date (and time) can be used to specify a file (e.g. time dependent parametrization) generally. These additional placeholders are specified in pointing brackets (`<...>`). The following table provides an overview of available replacements (the example starts on 2004-08-27 at 18:00). | Placeholder | Value | Example | |-------------|-------|---------| | `` | YYYYMMDD | 20040827 | | `` | YYYY | 2004 | | `` | MM | 08 | | `` | DD | 27 | | `` | HH | 18 | | `` | YYYYMMDDHH | 2004082718 | | `` | YYYYMMDD | 20040827 | | `` | YYYY | 2004 | | `` | MM | 08 | | `
` | DD | 27 | | `` | HH | 18 | :::{caution} Only the **start date for the member** will be substituted, **not of each chunk**. If you need e.g. monthly files and the simulation runs longer, you should use [patterns](#patterns). ::: ### Special characters There are several special characters, that usually cannot be escaped, so avoid usage in file names: - `%`: a pair of percent signs with characters in between is substituted by Autosubmit with a placeholder. This can be escaped with a single `%`, i.e. `%%` gives a literal `%`, but use is highly discouraged. - `|`: the vertical bar separates TARGET and LINK_NAME (see [here](#syntax-for-link-specification)) - `<>`: pointing brackets introduce [placeholder substitution](#placeholders) - `!`: _only at the beginning of TARGET_, this introduces lookup for [patterns](#patterns), which also makes all wildcards special characters. ## File location order Depending on the file type (general or domain specific) and the specified name (link TARGET), there are multiple options where to look for the file. The job `PRE_FIND_FILES` searches for all these files. If one of the files cannot be found and the file cannot be created (as ERA5 input data could be), this job fails. The search order is as follows: ### 1. Absolute path If the TARGET starts with a slash (`/`), it is treated as an absolute path and searched for directly. If this file is not present, file location will fail. ### 2. URL If the TARGET seems to be a URL, the URL is downloaded to `DIRECTORIES.INDIR` and subsequently linked. If downloading fails, file location fails. ### 3. Serach input and pool directories Next, several directories will be searched for. For each directory (`DIR`) it is first checked, whether the file is present directly in `DIR` (`DIR/FILE`), if not, whether it exists somewhere in a subdirectory of `DIR`. First, the `INDIR` is searched and then other pool directories (see below). The type of directories searched depends on the type of file to look for. In pool directories that are a URL, the file `POOLURL/TARGET` will be attempted to download. If the download succeeds, it is linked, otherwise file location fails. ## Pool directories The following table gives an overview of all pool directories. If multiple are given, they are searched from top to bottom. The config keys are specified in `conf/common/platforms/.yml`. You can add further pool directories to the list. | Pool | Default (Levante) | Default (Horeka) | | ------- | ------------------------------ | --------------------------------------- | | GRID | `/pool/data/ICON/grids/public` | `/lsdf/kit/imk/projects/icon/INPUT` | | GENERAL | `/pool/data/ICON` | `/lsdf/kit/imk/projects/icon` | | EXP **(a)** | \--- | `/lsdf/kit/imk/projects/icon/TESTSUITE` | **(a)**: These pools will ge the experiment name appended, i.e. the pool directory on HoreKa that is actually search is `/lsdf/kit/imk/projects/icon/TESTSUITE/`. | Pool | Default (all platforms) | | ---- | --------------------------------------------------------------------- | | GRID | `http://icon-downloads.mpimet.mpg.de/grids/public/edzw` | | | `http://icon-downloads.mpimet.mpg.de/grids/public/edzw` | | | | | ICON | `%ICON.INSTALLDIR%` | | | | | ART | `%ICON.INSTALLDIR%/externals/art/runctrl_examples/xml_ctrl/%EXPNAME%` | | | `%ICON.INSTALLDIR%/externals/art/runctrl_examples/photo_ctrl` | | | `%ICON.INSTALLDIR%/externals/art/runctrl_examples/init_ctrl` | | | `%ICON.INSTALLDIR%/externals/art` | ## Grid creation Grid files can automatically be created with *auto-icon*. To do so, the `pre-create_grid` job should be activated, e.g. with the create grid option of the init script. Details on the grid to be created can be supplied in the `GRID` section. For details, please refer to the `conf/art/experiments/template.yml` file. ## ERA5 or IFS input data In case you use ERA5 or IFS data as initial conditions, the raw data can be remapped to the ICON grid if required. Hereby, the IFS raw data (if applicable) file (e.g. `ifs_r1279+O_.grb`) needs to be findable in the above pool directories or already present in the input directory. The grid file will be located with the above described routines. The ERA5 raw data (if applicable) is retrieved automatically on Levante. On other machines, this is currently not implemented (see issue [#132](https://gitlab.dkrz.de/auto-icon/auto-icon/-/issues/132)). The remapping can be done with the DWD_ICON_TOOLS or CDO. The method to be used can be set in the init script. If you create and run your experiment or at least the job again, the remapping will be skipped and the existing file be used. If you want to do the remapping anyway, you can set `MISC.REQUIRE_REMAPPING` to `True` in the file `conf/art/simulation.yml`. ## Online archive of input data If (parts of) the input data is available as a downloadable repository, the URL can be specified in `DIRECTORIES.ARCHIVE` in the `/experiment.yml` file. It will be automatically downloaded and extracted into `INDIR`.