Data preparation

Let's assume that you have a matrix of count data stored in the file data.txt. In this tutorial, we shall use the pasilla RNA-seq dataset, which is available through Bioconductor. You can inspect the data using the head command at the terminal:

 1 $ head /path/to/data.txt
 2 
 3 treated1fb      treated2fb      treated3fb      untreated1fb    untreated2fb    untreated3fb    untreated4fb
 4 FBgn0000003     0       0       1       0       0       0       0
 5 FBgn0000008     78      46      43      47      89      53      27
 6 FBgn0000014     2       0       0       0       0       1       0
 7 FBgn0000015     1       0       1       0       1       1       2
 8 FBgn0000017     3187    1672    1859    2445    4615    2063    1711
 9 FBgn0000018     369     150     176     288     383     135     174
10 FBgn0000022     0       0       0       0       1       0       0
11 FBgn0000024     4       5       3       4       7       1       0
12 FBgn0000028     0       1       1       0       1       0       0
13 ...

The data consists of 7 libraries/samples with 14115 features each. The libraries are grouped in two different classes, treated and untreated.

Let's use IPython to filter the data. Execute ipython at the terminal and, at the subsequent IPython command prompt, type the following:

1 import numpy as np
2 import pandas as pd
3 import matplotlib.pylab as pl
4 
5 counts = pd.read_table('/path/to/data.txt')
6 row_sums = counts.sum(1)
7 idxs = row_sums > np.percentile(row_sums, 40)   # identifies the upper 60% of the data
8 counts_filt = counts[idxs]  
9 counts_filt.head()    # inspect the data

The filtered dataset contains 8461 features. The next step is to process your data. Learn how!