Documentation for administration of genex system

Introduction

The system is comprised of three basic units which interact with each other to provide the end user with a view to the expression data. These parts are :

  1. A postgresql based database containing expression data, probe set annotation, user information, personal annotation and more. The expression data and probe set annotation is loaded from the database into the memory of the server process at startup, and the database is used to faciliate data lookups and the storage of user specific selections and annotation.
  2. A server process which maintains all of the expression data in memory, communicates with the database server to faciliate data lookups and storage, communicates with client applications and performs statistical queries on the expression data.
  3. A client application which displays data retrieved from the server process and provides an interface for making both statistical and data based queries on the data.
This structure allows the simultaneous access to the data by an arbitrary number of client applications. As the data is loaded in the memory of the server process at all times, access to the data is rapid, not requiring any disk access, and statistical queries can be performed very rapidly on the data. It is theoretically possible to implment seperately different client applications as all that is required is that these follow the defined protocols. For example, it should be quite easy to implement perl scripts that access the data from the server, and perform specific analysis on this data.

Currently, expression data is only loaded into the database at server startup, whereas it is possible to implement ways of dynamically adding data to the server process during run times this is difficult to do without compromising analysis speeds. Given that adding data to the system is not a frequently occuring event, and that implementing a dynamic adding function well is not straightforward I haven't yet thought about it.

Additionally the server process requires a set of files containing genomic sequence in order to be able to server this to the client process. These files should contain only sequence, there should be one for each chromosome, and they should be complete, and synchronised with the specific version of the genome has been used to compile gene and probe set coordinates. Normal administrators/users should not modify these files or otherwise change components of the system.

Starting the server

The server process requires that the database server is running, and that the appropriate database is available to the server process once this starts up. Currently the name of the database and the access methods are hardcoded into the program, but this can be overrided with the -d command line switch. Don't use this though, unless you know what you are doing (which if you're reading this, you probably don't know yet). To start the server process do the following :

  1. Either log on to the system as the database superuser (usually postgres), or use the 'su - postgres' command to substitute user.
  2. Issue the command 'pg_ctl start'.
  3. Exit from the superuse, either by logging off or by typing 'exit' to get back to your normal user mode.
  4. Log in as 'chipdb' or substitute user by 'su - chipdb' and following the prompt.
  5. Issue the command 'restart_server.pl'. This is a small script which tries to work out if there is a currently running server, which if running, gets killed, followed by the restarting of the server itself.
  6. Logout, and wait for some time for the server to start. This may take up to 30 minutes or so. This is a long time, but it should not be necessary to restart the server very often (I usually only restart the server once a month or so).

Steps 1-2 require that postgresql has been properly installed with postgres as the database super user, and that the PATH environment variable has been set to include the appropriate directories (for a default installation of postgresql this will be '/usr/local/pgsql/bin'). It also requires that the PG_DATA environment variable has been set appropriately and that the dynamic linker knows the location of the pgsql libraries. Note that this is normally done during the installation of the postgresql server, but that these locations may vary depending on how the installation has been carried out. I tend to always follow the default instructions given by the postgresql team (everything in /usr/local), but installations which have been performed by Linux distributions may vary.

Steps 4-5 require that there is a user called chipdb, and that the script restart_server.pl is accessible and in the path of this user. This should have happened as part of the normal installation procedure, so administrators shouldn't have to worry about this. However, this script is normally in the '/home/chipdb/scripts' directory, with a link to this script present in '/home/chipdb/bin' to place the script in the path of chipdb. This script needs to be edited when newer versions of the database are started. This is fairly self-explanatory, but will be described elsewhere. These things should just work, but tweaks may be necessary depdening on the specifics of your environment.

Stopping the server

This shouldn't usually be necessary, as the server will stop when the machine is shut down, however if you wish to stop the server the easiest way is to issue the command :
ps aux | grep mtserver
if the server is running this should result in something like :

chipdb    3066  0.1 18.9 713244 696124 ?     S    Nov10   4:23 mtserver_2.10
chipdb    3089  0.0 18.9 713244 696124 ?     S    Nov10   0:01 mtserver_2.10
chipdb    3090  0.0 18.9 713244 696124 ?     S    Nov10   0:47 mtserver_2.10
martin    8775  0.0  0.0  1772  596 pts/6    S    14:09   0:00 grep mtserver

The last line just lists the grep job that you just performed, and the other lines list the different processes which were started with a command including the text mtserver. Choose the earliest one (in this case Nov 10 0:01) and kill that by typing :

kill 3089

This has to be done either as the superuser (root) or as the user who started the server (in this case chipdb), so before you can kill the process you have to either log in or just su to root or chipdb. It is recommended that you su to chipdb, not to root, as in general it is better to avoid being root. This is essentially because root is all-powerful and can do anything to your system (including removing everything).

After you have done this, repeat the ps command above to see if all the mtserver processes have been killed. Note that you cannot start more than one instance of mtserver at the same time as the process listens on a specific port. (Actually this is possible through the use of the -p switch, but you probably don't want to do that).

General notes:

as usual on unix/linux systems, if in doubt try something like the man commands for more information. eg. man su, man ps, man grep, man man, man -k port, and so on.

Adding data to the System

Currently the process for adding expression data to the system is a cumbersome and somewhat dangerous endeavour. This is primarily because the data is stored in a database, which provides for some integrity checking, but, due to the size of the data it is not realistic for the database to provide full integrity checking. This is also cumbersome because it is currently carried out by a series of scripts without any graphical interface. These scripts provide for some error checking, but they cannot entirely prevent the user from doing the wrong thing. Since it seems that there will in the future exist more users for teE programs, this is something which will be improved on in the near future, though not before this release. So hang on in there.. for a while.

Currently the process of adding data to the system requires three (well four perhaps) steps :

  1. Run a script (addFile.pl) which prompts you for the file being added and some description of the sample. This script reorganises the data in the .CEL files and normalises the resulting data which is then written to file. The identity of these files are added to the database as a record of which file represents what data.
  2. Convert the data in newly added files into a format that can be copied into the database. This usese a separate script which checks the database to check which files to open, and then converts the data in those to a format that can be copied into the database.
  3. Copy the data in the file created in step 2 into the database.
  4. Restart the server as described above.

Details:

addFile.pl

This script is located in the /home/chipdb/affymetrix/Format_converters directory. It can only be run by chipdb or other database users. This is basically to try and minimise the risk of corrupting the database. This script is not currently in the PATH of chipdb, and so has to be called with the full path (i.e. '/home/chipdb/affymetrix/Format_converters/addFile.pl'). This can easily be changed (i.e. either alter the PATH variable to include this path, or create a symlink into /home/chipdb/bin/).

This script needs to be run from the directory where the .CEL files are located. Note that these should not be located on a CDROM or other external media, but should be copied to the specific location where they will be stored. This can be anywhere on the hard-disk (currently the CEL files are in /usr/genome/CEL). This is important as the current location of the file will be entered into the database including the path from where the script is called. The database cannot keep track of where the user puts the files and so these should not be moved. It is generally a good idea to remove write access to the CEL files to prevent these from being overwritten. This is not important for the intermediate file types which are generated by addFile.pl as these can be regenerated from the CEL files.

This script will prompt the user for a number of inputs which will need to be filled in accordingly by the user, and will then proceed to reformat the data in the files and input the details into the database.

Note, if you realise that you've made some stupid mistake, then you can stop the process before completion by pressing Ctrl. C. You may need to press Ctrl C several times, as the script accesses outside programs.

  1. The name of the cel file to be added. This file should be in the same directory as where the script is being called from. Type the name and press return.
  2. The experiment number to be given to the experiment. This should be a unique floating point number, and should only be used once for a given sample and a given chip. Replicate samples should always have different experiment numbers. (But if one sample is used with different chips, then these should all have the same sample number). The experiment number should probably be called the sample number, but for historical reasons has ended being called the experiment number. This number is important, and is used to set the order in which the samples are displayed when looking at expression patterns. Chose an appropriate experiment number by checking the experiments table in the expression database (currently version 2 hence expression_2) and finding a number that lies in the position where you want your sample to lie. Do this carefully as it is not completely trivial to change this number after its been set. In the future, this will change, such that arbitrary orders can be chosen, but for now you'll have to live with this. (Hopefully not that much longer). I suggest getting the descriptions with : 'select index, experiment, expt_group, short_description from experiments order by experiment' in a psql session to work out what to do.
  3. The script will then proceed to open the file to try and determine what chip was used in the experiment. Currently the programs know about 5 different chips, the MGU74V2 (A, B, and C) the MOE430 (A, B and one merged chip). If you are using a different chips, then you'll have to do a whole load ofstuff that is outside of the scope of these instructions.
  4. The script then checks whether or not the experiment number has been used previously. If it has not, the user is then prompted for 3 inputs.
    1. A description of the sample. This should be a reasonably short description consisting of one line of text. For convenience this can take several lines of text (i.e. you can press return and keep typing on the next line). To finish writing the text, press return followed by Ctrl D. Note that once you've pressed return you won't be able to make any changes to the text. This should improve in the future.
    2. A short description of the sample. This will be displayed by the client program and used as a label for the different samples. This should only be a few words long, and will be entered into the system when enter is pressed.
    3. An experimental group identifier. The different samples are defined by their sample numbers, and a group number. The group number is an integer which serves to group the different samples into groups. Check the experiments table to work out a reasonable group number, (i.e. similar samples) or create a new group by using a different number. This number isn't currently used by the system, but it may be used in the future as a default grouping mechanism. However, it will also be necessary to provide dynamic means of grouping different samples, as well as to define sample replicates in the future.

Once the script has gone through the above stages, it will run some other scripts which should be present in the /home/chipdb/affymetrix/Format_converters directory. If there are any problems this is the first place to look. The first of these scrips will read the .CEL file, and using the information present in the appropriate .CDF.index file it will create a file containing the expression data listed by the probe set identities rather than by the X-Y position on the chip. This file will then be processed by a third script which creates a file where these values have been normalised by the median intensity of signals on the chip. As the script progresses it also inputs the identity of the files created into the database as a record, and to allow the processing of all files by other means.

Converting the data to a format that is acceptable to the database

The next script to be run is the 'int_norm_to_db_array_style.pl' script. This script does not need to be run once for each .CEL file that's been entered, as it reads several files at once. It should be run after the administrator has run addFile.pl for all the new data to be incorporated. This script will create a file that can be copied into the database through a simple SQL command. Again this script is present in the same directory as the addFile.pl script and needs to be called with the whole path. Although this script can be run from anywhere, I recommend not running it from the same directory as where you keep the CEL files, but rather in some directory where you can keep the files produced by the script. This script determines which new data needs to be added by inspecting the database to determine for which files data has not been added to the data table in the database. This should work completely automatically but as with all automation, there'll probably be some problems at some stage. This script produces a file called 'data_postgres' which consists of a file which can be copied into the data table of the database (see next stage).

Copying the data to the database

In order to copy the data to the expression database it is necessary to connect to the database and issue a couple of SQL commands. This can quite easily be automated, but doing it manually is more informative. Basically coneect to the database using the 'psql' programme as the chipdb user. Then issue the command to copy the data from the the file produced in the preceding step. Basically follow the following steps..

First log in as chipdb or do 'su - chipdb to become the chipdb user, and then issue the following commands...

psql expression
\copy data from /the/path/to/the/file/you/made/data_postgres
\q

Lines 2 and 3 of the above are actually typed into the psql client programme. Be careful with this as it will allow you to do all sorts of bad things to your database. Do not repeat the copy line (and yes it and the third line should start with \) above, as this will enter the data twice into the database. Things may still work, but it is possible that doing this may cause trouble further down the line.

And remember, don't be afraid of the command line or unix in general... it's not as bad as you think. Honestly..