Documentation for administration of genex system

Introduction

The system is comprised of three basic units which interact with each other to provide the end user with a view to the expression data. These parts are :

  1. A postgresql based database containing expression data, probe set annotation, user information, personal annotation and more. The expression data and probe set annotation is loaded from the database into the memory of the server process at startup, and the database is used to faciliate data lookups and the storage of user specific selections and annotation.
  2. A server process which maintains all of the expression data in memory, communicates with the database server to faciliate data lookups and storage, communicates with client applications and performs statistical queries on the expression data.
  3. A client application which displays data retrieved from the server process and provides an interface for making both statistical and data based queries on the data.
This structure allows the simultaneous access to the data by an arbitrary number of client applications. As the data is loaded in the memory of the server process at all times, access to the data is rapid, not requiring any disk access, and statistical queries can be performed very rapidly on the data. It is theoretically possible to implement seperately different client applications as all that is required is that these follow the defined (defined in the code only at the moment) protocols. For example, it should be quite easy to implement perl scripts that access the data from the server, and perform specific analysis on this data.

Currently, expression data is only loaded into the database at server startup, whereas it is possible to implement ways of dynamically adding data to the server process during run times this is difficult to do without compromising analysis speeds. Given that adding data to the system is not a frequently occuring event, and that implementing a dynamic adding function well is not straightforward I haven't yet thought about it.

Additionally the server process requires a set of files containing genomic sequence in order to be able to server this to the client process. These files should contain only sequence, there should be one for each chromosome, and they should be complete, and synchronised with the specific version of the genome has been used to compile gene and probe set coordinates. Normal administrators/users should not modify these files or otherwise change components of the system.

Starting the server

The server process requires that the database server is running, and that the appropriate database is available to the server process once this starts up. Currently the name of the database and the access methods are hardcoded into the program, but this can be overrided with the -d command line switch. Don't use this though, unless you know what you are doing (which if you're reading this, you probably don't know yet). To start the server process do the following :

  1. Either log on to the system as the database superuser (usually postgres), or use the 'su - postgres' command to substitute user.
  2. Issue the command 'pg_ctl start'.
  3. Exit from the superuse, either by logging off or by typing 'exit' to get back to your normal user mode.
  4. Log in as the system administrator (or su).
  5. Issue the command 'restart_server.pl'. This is a small script which tries to work out if there is a currently running server, which if running, gets killed, followed by the restarting of the server itself. This is a very simple script that might not work on all platforms. Note that it also will not work if you are running several versions of the server using different databases and ports. If this is the case then you are on your own. The server can also be started manually by typing the command 'mtserver_x.y'. Where x and y correspond to the major and minor version numbers of the server. This command can take two switches, -p for the portnumber and -d for the name of the database to be used. On starting the server process calls daemon in order to background itself. However, it will continue to spew out text to the terminal from which it was started. This can be useful if there are any problems, but after making sure that the server has started one can just close the terminal. To stop the server use the appropriate combinations of ps and kill (eg. 'ps aux | grep mtserver' to get the process id and kill to stop it).
  6. Logout, and wait for some time for the server to start. This may take up to 30 minutes or so on slower hardware. This is a long time, but unless you have a lot of data (>100 samples) or your hardware is slow it shouldn't be too bad.

Steps 1-2 require that postgresql has been properly installed with postgres as the database super user, and that the PATH environment variable has been set to include the appropriate directories (for a default installation of postgresql this will be '/usr/local/pgsql/bin'). It also requires that the PG_DATA environment variable has been set appropriately and that the dynamic linker knows the location of the pgsql libraries. Note that this is normally done during the installation of the postgresql server, but that these locations may vary depending on how the installation has been carried out. I tend to always follow the default instructions given by the postgresql team (everything in /usr/local), but installations which have been performed by Linux distributions may vary.

Steps 4-5 require that the script restart_server.pl is accessible and in the path of the user who previously started the server. This is not normally set up by the installation, as there are many different ways of configuring the system. I recommend administrators to have a look at the script for some ideas as to what needs to be done. As mentioned above this script only works in the simplest of circumstances so it is worthwhile to write something more specific for the local administration.

Stopping the server

This shouldn't usually be necessary, as the server will stop when the machine is shut down, however if you wish to stop the server the easiest way is to issue the command :
ps aux | grep mtserver
if the server is running this should result in something like :

chipdb    3066  0.1 18.9 713244 696124 ?     S    Nov10   4:23 mtserver_2.10
chipdb    3089  0.0 18.9 713244 696124 ?     S    Nov10   0:01 mtserver_2.10
chipdb    3090  0.0 18.9 713244 696124 ?     S    Nov10   0:47 mtserver_2.10
martin    8775  0.0  0.0  1772  596 pts/6    S    14:09   0:00 grep mtserver

The last line just lists the grep job that you just performed, and the other lines list the different processes which were started with a command including the text mtserver. Choose the earliest one (in this case Nov 10 0:01) and kill that by typing :

kill 3089

This has to be done either as the superuser (root) or as the user who started the server (in this case chipdb), so before you can kill the process you have to either log in or just su to root or chipdb. It is recommended that you su to chipdb, not to root, as in general it is better to avoid being root. This is essentially because root is all-powerful and can do anything to your system (including removing everything).

After you have done this, repeat the ps command above to see if all the mtserver processes have been killed. Note that you cannot start more than one instance of mtserver at the same time unless you use the -p switch to specify the port number. However, you probably don't want to start multiple server processes accessing the same database, as there are no mechanisms for these to deal with updates to the database from each other, and this could potentially result in some bad behaviour.

General notes:

as usual on unix/linux systems, if in doubt try something like the man commands for more information. eg. man su, man ps, man grep, man man, man -k port, and so on.

Adding data to the System

Currently the process for adding expression data to the system is a cumbersome and somewhat dangerous endeavour. This is primarily because the data is stored in a database, which provides for some integrity checking, but, due to the size of the data it is not realistic for the database to provide full integrity checking. This is also cumbersome because it is currently carried out by a series of scripts without any graphical interface. These scripts provide for some error checking, but they cannot entirely prevent the user from doing the wrong thing. Since it seems that there will in the future exist more users for the programs, this is something which will be improved on in the near future (?), though not before this release. So hang on in there.. for a while.

Currently the process of adding data to the system requires three (well four perhaps) steps :

  1. Reorganise the data in the .CEL files and add information about the samples to the database. This can be done either using the addFileBatch.pl or the addFile.pl scripts. I recommend using the addFileBatch.pl script rather than the interactive addFile.pl script. The batch method is less error prone, and since it takes an input file, you also have a record of what you asked the script to do. Both these scripts use two other scripts also found in the Format_converters directory as well as a number of files describing the structure of the .CEL files. If you have any problems, read the error messages and have a look at the scripts. They are not complicated.
  2. Convert the data in newly added files into a format that can be copied into the database. This usese a separate script which checks the database to check which files to open, and then converts the data in those to a format that can be copied into the database.
  3. Copy the data in the file created in step 2 into the database.
  4. Restart the server as described above.

Details:

addFile.pl and addFileBatch.pl

These scripts are located in the ~/affy_expression/affymetrix/Format_converters directory. They can only be run by database users with the appropriate privileges. This is basically to try and minimise the risk of corrupting the database. These script are not currently in the PATH of chipdb, and so has to be called with the full path (eg. '~/affy_expression/affymetrix/Format_converters/addFile.pl'). This can easily be changed (i.e. either alter the PATH variable to include this path, or create a symlink into ~/bin/).

The scripts needs to be run from the directory where the .CEL files are located. Note that these should not be located on a CDROM or other external media, but should be copied to the specific location where they will be stored. This can be anywhere on the hard-disk. This is important as the current location of the file will be entered into the database including the path from where the script is called. The database cannot keep track of where the user puts the files and so these should not be moved after they have been entered into the system. It is generally a good idea to remove write access to the CEL files to prevent these from being overwritten. This is not important for the intermediate file types which are generated by addFile.pl as these can be regenerated from the CEL files.

Both scripts work by adding a number of entries to the database that describe the samples and files to be entered, and then calling additional scripts which convert the .CEL formatted data to something that can be copied into the database. The batch script reads in these details from a specially formatted input file (described below), whereas the addFile.pl scripts prompts the user for the details in an interactive (but not very convenient) manner.

Note, if you realise that you've made some stupid mistake, then you can stop the process before completion by pressing Ctrl. C. You may need to press Ctrl C several times, as the script accesses outside programs.

The fields which need to be entered are as follows:

  1. The name and path of the cel file to be entered. The addFile.pl script assumes that the script was called from the directory containing the .CEL file and uses its current working directory for the path. For the batch script the directory needs to be explicitly entered for each file.
  2. The experiment number to be given to the experiment. This should be a unique floating point number, and should only be used once for a given sample and a given chip. Replicate samples should always have different experiment numbers. (But if one sample is used with different chips, then these should all have the same sample number). The experiment number should probably be called the sample number, but for historical reasons has ended being called the experiment number. This number determines the default order in which the samples are displayed. Although the display order can be changed in recent versions, there are no mechanisms for storing user defined orders, so this number still carries a great deal of importance. It is however, an intrinsically bad idea (kludge) and the whole system for dealing with samples will be updated in the future (necessary as our sample numbers keep increasing). Chose an appropriate experiment number by checking the experiments table in the expression database and finding a number that lies in the position where you want your sample to lie. Do this carefully as it is not completely trivial to change this number after its been set. I suggest getting the descriptions with : 'select index, experiment, expt_group, short_description from experiments order by experiment' in a psql session to work out what to do.
  3. A group number. This is just an arbitrarily chosen integer value. Give the same integer value to samples which you think belong together. This is a very basic way of providing some kind of default grouping. In future this will be expanded upong to provide both user defined groupings and orders. It is not particularly important, and is in fact ignored by the current system.
  4. A description and a short description of the sample entered. This is just some text with no particular requirements that I know of. There may be some things that postgresql might not like (don't stick backslashes and the like in there unless you know what you are doing) but the scripts try to handle most things. The short description is used in a number of places in the client program, and should not be too long (try to keep it to a few words) for aesthetic reasons.
  5. The chip id that was used in the experiment. This is a numerical id which can be obtained from the chips table in the expression database (see below).
    
    chicken_expression=# select * from chips;
     index |     id     |                      description
    -------+------------+--------------------------------------------------------
         1 | MG_U74Av2  | Mouse genome set A, version 2
         2 | MG_U74Bv2  | Mouse genome set B, version 2
         3 | MG_U74Cv2  | Mouse genome set C, version 2
         4 | MOE430A    | Mouse genome set 2 A
         5 | MOE430B    | Mouse genome set 2 B
         6 | Mouse430_2 | MOE430A and MOE430B on one chip. Uses same identifiers
         7 | HG-U133A   | Human genome set 133 A
         8 | HG-U133B   | Human genome set 133 B
         9 | Chicken    | Chicken genome set
    (9 rows)
    
    
    The chips are defined by the index number in this table. It is important that you use the correct number. Both scripts will try to open the .CEL file and determine the chip used, but in order to make sure that you know which data you are entering both scripts also force you to make this explicit. If your chip is not represented here, let me know and I'll see what I can do. The procedure for adding a new chip to the system is not entirely straightforward, but it is thankfully getting less and less painful.

The addFile.pl script goes through the entry of these steps interactively with the user in the manner described below.

  1. The user is first prompted for the name of the file. (Remember that the script should be run from the directory where the .CEL files are present).
  2. The script will then proceed to open the file to try and determine what chip was used in the experiment, and asks the user whether it is the correct one.
  3. The script then checks whether or not the experiment number has been used previously. If it has not, the user is then prompted for 3 inputs.
    1. A description of the sample. This should be a reasonably short description consisting of one line of text. For convenience this can take several lines of text (i.e. you can press return and keep typing on the next line). To finish writing the text, press return followed by Ctrl D. Note that once you've pressed return you won't be able to make any changes to the text. This should improve in the future.
    2. A short description of the sample. This will be displayed by the client program and used as a label for the different samples. This should only be a few words long, and will be entered into the system when enter is pressed.
    3. An experimental group identifier. The different samples are defined by their sample numbers, and a group number. The group number is an integer which serves to group the different samples into groups. Check the experiments table to work out a reasonable group number, (i.e. similar samples) or create a new group by using a different number. This number isn't currently used by the system, but it may be used in the future as a default grouping mechanism. However, it will also be necessary to provide dynamic means of grouping different samples, as well as to define sample replicates in the future.

Once the script has gone through the above stages, it will run some other scripts which should be present in the ~affy_expression/affymetrix/Format_converters directory. If there are any problems this is the first place to look. The first of these scrips will read the .CEL file, and using the information present in the appropriate .CDF.index file it will create a file containing the expression data listed by the probe set identities rather than by the X-Y position on the chip. This file will then be processed by a third script which creates a file where these values have been normalised by the median intensity of signals on the chip. As the script progresses it also inputs the identity of the files created into the database as a record, and to allow the processing of all files by other means.

The addFileBatch.pl script essentially performs the same functions as the addFile.pl script but reads its input from a specially formatted file. This format relies heavily on the presence of tabs and the file must adhere strictly to the format. (See example below and further explanation).

Comments to the file can be added anywhere except within areas where new or old
experiments are described. Lines containing comments should not begin with
the word Experiment (this line ok as the first word is not Experiment) or with
three dashes (---) as these are used to delimit the regions where samples are described.

Experiment	new	3.5	7	"Description of the experiment enclosed in quotation marks"	"short description"
	1	Name_of_cel_file_A.CEL	/path/to/cel_file
	2	Name_of_cel_file_B.CEL	/path/to/cel_file
	3	Name_of_cel_file_C.CEL	/path/to/cel_file
---

some more comments, about the next experiment. The next experiment (sample) was defined in the system
previously, but data was only added from chip no. 1. This time data from the chip no 2 is added as well
Note that term Experiment is now followed by
the key word 'old' rather than 'new' as in the previous section.

Experiment	old	3.6	7	"Description of the file as above"	"short description"
	2	Name_of_cel_file_B.CEL	/path/to/cel_file
---

The batch file has only three keywords, 'Experiment', 'new' and 'old' in addition to using tabs and three dashes to signify the end of an entry region. Entry regions begin with the Keyword 'Experiment' which should be followed by the 'new' signifying that a new experiment description needs to be entered into the database, or 'old' signifying that we are just adding data for an already described experiment. This is then followed by a floating point number (the experiment identifier) which is used to determine the default order of the samples, and the group identifier (as described above). This is then followed by a description and a short description enclosed in double quotation marks. The different fields need to be separated by at least one white space characters (using tabs works fine), but are otherwise fairly simple. All these terms need to be on the same line, you cannot put returns into descriptions. This is just for programming convenience, and if you should wish to be able to do this I encourage you to write some scripts which allow this.

The lines following this tell the program where to find the .CEL files and which chips they represent. Each line should begin with a tab, followed by the id of the chip, the name of the cel file and the path to the cel file (the fields again separated by whitespace). Each line should represent one .CEL file pertaining to the sample described above in the header line (the one beginning with Experiment). Note that you cannot put several files representing the same chip and the same sample here. That's to say that you cannot specify replicates at this stage. Primarily this is because replicate status should not be hard-coded into the system, but it should be possible to address dynamically at later stages of the analysis. The system, at the moment does not allow this, but it is my intention to allow the dynamic definition of replicate status at later stages (along with dynamic groupings and orderings). The last line for the give sample region should begin with three dashes, and signals to the program that there are no more cel files to add for the given sample.

Converting the data to a format that is acceptable to the database

The next script to be run is the 'int_norm_to_db_array_style.pl' script. This script does not need to be run once for each .CEL file that's been entered, as it reads several files at once. It should be run after the administrator has run addFile.pl or addFileBatch.pl for all the new data to be incorporated. This script will create a file that can be copied into the database through a simple SQL command. Again this script is present in the same directory as the addFile.pl script and needs to be called with the whole path. Although this script can be run from anywhere, I recommend not running it from the same directory as where you keep the CEL files, but rather in some directory where you can keep the files produced by the script. This script determines which new data needs to be added by inspecting the database to determine for which files data has not been added to the data table in the database. This should work completely automatically but as with all automation, there'll probably be some problems at some stage. This script produces a file called 'data_postgres' which consists of a file which can be copied into the data table of the database (see next stage).

Copying the data to the database

In order to copy the data to the expression database it is necessary to connect to the database and issue a couple of SQL commands. This can quite easily be automated, but doing it manually is more informative. Basically coneect to the database using the 'psql' programme as the database owner. Then issue the command to copy the data from the the file produced in the preceding step. Basically follow the following steps..

First log in as the database owner or do 'su - name_of_owner' to become the owner, and then issue the following commands...

psql expression
\copy data from /the/path/to/the/file/you/made/data_postgres
\q

Lines 2 and 3 of the above are actually typed into the psql client programme. Be careful with this as it will allow you to do all sorts of bad things to your database. Do not repeat the copy line (and yes it and the third line should start with \) above, as this will enter the data twice into the database. Things may still work, but it is possible that doing this may cause trouble further down the line.

And remember, don't be afraid of the command line or unix in general... it's not as bad as you think. Honestly..