The system is comprised of three basic units which interact with each other to provide the end user with a view to the expression data. These parts are :
Currently, expression data is only loaded into the database at server startup, whereas it is possible to implement ways of dynamically adding data to the server process during run times this is difficult to do without compromising analysis speeds. Given that adding data to the system is not a frequently occuring event, and that implementing a dynamic adding function well is not straightforward I haven't yet thought about it.
Additionally the server process requires a set of files containing genomic sequence in order to be able to server this to the client process. These files should contain only sequence, there should be one for each chromosome, and they should be complete, and synchronised with the specific version of the genome has been used to compile gene and probe set coordinates. Normal administrators/users should not modify these files or otherwise change components of the system.
The server process requires that the database server is running, and that the appropriate database is available to the server process once this starts up. Currently the name of the database and the access methods are hardcoded into the program, but this can be overrided with the -d command line switch. Don't use this though, unless you know what you are doing (which if you're reading this, you probably don't know yet). To start the server process do the following :
Steps 1-2 require that postgresql has been properly installed with postgres as the database super user, and that the PATH environment variable has been set to include the appropriate directories (for a default installation of postgresql this will be '/usr/local/pgsql/bin'). It also requires that the PG_DATA environment variable has been set appropriately and that the dynamic linker knows the location of the pgsql libraries. Note that this is normally done during the installation of the postgresql server, but that these locations may vary depending on how the installation has been carried out. I tend to always follow the default instructions given by the postgresql team (everything in /usr/local), but installations which have been performed by Linux distributions may vary.
Steps 4-5 require that the script restart_server.pl is accessible and in the path of the user who previously started the server. This is not normally set up by the installation, as there are many different ways of configuring the system. I recommend administrators to have a look at the script for some ideas as to what needs to be done. As mentioned above this script only works in the simplest of circumstances so it is worthwhile to write something more specific for the local administration.
This shouldn't usually be necessary, as the server will stop when the machine is shut down, however if you wish to stop the server the easiest way is to issue the command :
ps aux | grep mtserver
if the server is running this should result in something like :
chipdb 3066 0.1 18.9 713244 696124 ? S Nov10 4:23 mtserver_2.10 chipdb 3089 0.0 18.9 713244 696124 ? S Nov10 0:01 mtserver_2.10 chipdb 3090 0.0 18.9 713244 696124 ? S Nov10 0:47 mtserver_2.10 martin 8775 0.0 0.0 1772 596 pts/6 S 14:09 0:00 grep mtserver
The last line just lists the grep job that you just performed, and the other lines list the different processes which were started with a command including the text mtserver. Choose the earliest one (in this case Nov 10 0:01) and kill that by typing :
kill 3089
This has to be done either as the superuser (root) or as the user who started the server (in this case chipdb), so before you can kill the process you have to either log in or just su to root or chipdb. It is recommended that you su to chipdb, not to root, as in general it is better to avoid being root. This is essentially because root is all-powerful and can do anything to your system (including removing everything).
After you have done this, repeat the ps command above to see if all the mtserver processes have been killed. Note that you cannot start more than one instance of mtserver at the same time unless you use the -p switch to specify the port number. However, you probably don't want to start multiple server processes accessing the same database, as there are no mechanisms for these to deal with updates to the database from each other, and this could potentially result in some bad behaviour.
as usual on unix/linux systems, if in doubt try something like the man commands for more information. eg. man su, man ps, man grep, man man, man -k port, and so on.
Currently the process for adding expression data to the system is a cumbersome and somewhat dangerous endeavour. This is primarily because the data is stored in a database, which provides for some integrity checking, but, due to the size of the data it is not realistic for the database to provide full integrity checking. This is also cumbersome because it is currently carried out by a series of scripts without any graphical interface. These scripts provide for some error checking, but they cannot entirely prevent the user from doing the wrong thing. Since it seems that there will in the future exist more users for the programs, this is something which will be improved on in the near future (?), though not before this release. So hang on in there.. for a while.
Currently the process of adding data to the system requires three (well four perhaps) steps :
These scripts are located in the ~/affy_expression/affymetrix/Format_converters directory. They can only be run by database users with the appropriate privileges. This is basically to try and minimise the risk of corrupting the database. These script are not currently in the PATH of chipdb, and so has to be called with the full path (eg. '~/affy_expression/affymetrix/Format_converters/addFile.pl'). This can easily be changed (i.e. either alter the PATH variable to include this path, or create a symlink into ~/bin/).
The scripts needs to be run from the directory where the .CEL files are located. Note that these should not be located on a CDROM or other external media, but should be copied to the specific location where they will be stored. This can be anywhere on the hard-disk. This is important as the current location of the file will be entered into the database including the path from where the script is called. The database cannot keep track of where the user puts the files and so these should not be moved after they have been entered into the system. It is generally a good idea to remove write access to the CEL files to prevent these from being overwritten. This is not important for the intermediate file types which are generated by addFile.pl as these can be regenerated from the CEL files.
Both scripts work by adding a number of entries to the database that describe the samples and files to be entered, and then calling additional scripts which convert the .CEL formatted data to something that can be copied into the database. The batch script reads in these details from a specially formatted input file (described below), whereas the addFile.pl scripts prompts the user for the details in an interactive (but not very convenient) manner.
Note, if you realise that you've made some stupid mistake, then you can stop the process before completion by pressing Ctrl. C. You may need to press Ctrl C several times, as the script accesses outside programs.
The fields which need to be entered are as follows:
chicken_expression=# select * from chips; index | id | description -------+------------+-------------------------------------------------------- 1 | MG_U74Av2 | Mouse genome set A, version 2 2 | MG_U74Bv2 | Mouse genome set B, version 2 3 | MG_U74Cv2 | Mouse genome set C, version 2 4 | MOE430A | Mouse genome set 2 A 5 | MOE430B | Mouse genome set 2 B 6 | Mouse430_2 | MOE430A and MOE430B on one chip. Uses same identifiers 7 | HG-U133A | Human genome set 133 A 8 | HG-U133B | Human genome set 133 B 9 | Chicken | Chicken genome set (9 rows)The chips are defined by the index number in this table. It is important that you use the correct number. Both scripts will try to open the .CEL file and determine the chip used, but in order to make sure that you know which data you are entering both scripts also force you to make this explicit. If your chip is not represented here, let me know and I'll see what I can do. The procedure for adding a new chip to the system is not entirely straightforward, but it is thankfully getting less and less painful.
The addFile.pl script goes through the entry of these steps interactively with the user in the manner described below.
Once the script has gone through the above stages, it will run some other scripts which should be present in the ~affy_expression/affymetrix/Format_converters directory. If there are any problems this is the first place to look. The first of these scrips will read the .CEL file, and using the information present in the appropriate .CDF.index file it will create a file containing the expression data listed by the probe set identities rather than by the X-Y position on the chip. This file will then be processed by a third script which creates a file where these values have been normalised by the median intensity of signals on the chip. As the script progresses it also inputs the identity of the files created into the database as a record, and to allow the processing of all files by other means.
The addFileBatch.pl script essentially performs the same functions as the addFile.pl script but reads its input from a specially formatted file. This format relies heavily on the presence of tabs and the file must adhere strictly to the format. (See example below and further explanation).
Comments to the file can be added anywhere except within areas where new or old experiments are described. Lines containing comments should not begin with the word Experiment (this line ok as the first word is not Experiment) or with three dashes (---) as these are used to delimit the regions where samples are described. Experiment new 3.5 7 "Description of the experiment enclosed in quotation marks" "short description" 1 Name_of_cel_file_A.CEL /path/to/cel_file 2 Name_of_cel_file_B.CEL /path/to/cel_file 3 Name_of_cel_file_C.CEL /path/to/cel_file --- some more comments, about the next experiment. The next experiment (sample) was defined in the system previously, but data was only added from chip no. 1. This time data from the chip no 2 is added as well Note that term Experiment is now followed by the key word 'old' rather than 'new' as in the previous section. Experiment old 3.6 7 "Description of the file as above" "short description" 2 Name_of_cel_file_B.CEL /path/to/cel_file ---
The batch file has only three keywords, 'Experiment', 'new' and 'old' in addition to using tabs and three dashes to signify the end of an entry region. Entry regions begin with the Keyword 'Experiment' which should be followed by the 'new' signifying that a new experiment description needs to be entered into the database, or 'old' signifying that we are just adding data for an already described experiment. This is then followed by a floating point number (the experiment identifier) which is used to determine the default order of the samples, and the group identifier (as described above). This is then followed by a description and a short description enclosed in double quotation marks. The different fields need to be separated by at least one white space characters (using tabs works fine), but are otherwise fairly simple. All these terms need to be on the same line, you cannot put returns into descriptions. This is just for programming convenience, and if you should wish to be able to do this I encourage you to write some scripts which allow this.
The lines following this tell the program where to find the .CEL files and which chips they represent. Each line should begin with a tab, followed by the id of the chip, the name of the cel file and the path to the cel file (the fields again separated by whitespace). Each line should represent one .CEL file pertaining to the sample described above in the header line (the one beginning with Experiment). Note that you cannot put several files representing the same chip and the same sample here. That's to say that you cannot specify replicates at this stage. Primarily this is because replicate status should not be hard-coded into the system, but it should be possible to address dynamically at later stages of the analysis. The system, at the moment does not allow this, but it is my intention to allow the dynamic definition of replicate status at later stages (along with dynamic groupings and orderings). The last line for the give sample region should begin with three dashes, and signals to the program that there are no more cel files to add for the given sample.
The next script to be run is the 'int_norm_to_db_array_style.pl' script. This script does not need to be run once for each .CEL file that's been entered, as it reads several files at once. It should be run after the administrator has run addFile.pl or addFileBatch.pl for all the new data to be incorporated. This script will create a file that can be copied into the database through a simple SQL command. Again this script is present in the same directory as the addFile.pl script and needs to be called with the whole path. Although this script can be run from anywhere, I recommend not running it from the same directory as where you keep the CEL files, but rather in some directory where you can keep the files produced by the script. This script determines which new data needs to be added by inspecting the database to determine for which files data has not been added to the data table in the database. This should work completely automatically but as with all automation, there'll probably be some problems at some stage. This script produces a file called 'data_postgres' which consists of a file which can be copied into the data table of the database (see next stage).
In order to copy the data to the expression database it is necessary to connect to the database and issue a couple of SQL commands. This can quite easily be automated, but doing it manually is more informative. Basically coneect to the database using the 'psql' programme as the database owner. Then issue the command to copy the data from the the file produced in the preceding step. Basically follow the following steps..
First log in as the database owner or do 'su - name_of_owner' to become the owner, and then issue the following commands...
psql expression \copy data from /the/path/to/the/file/you/made/data_postgres \q
Lines 2 and 3 of the above are actually typed into the psql client programme. Be careful with this as it will allow you to do all sorts of bad things to your database. Do not repeat the copy line (and yes it and the third line should start with \) above, as this will enter the data twice into the database. Things may still work, but it is possible that doing this may cause trouble further down the line.
And remember, don't be afraid of the command line or unix in general... it's not as bad as you think. Honestly..