Let's assume that you have a matrix of count data stored in the file data.txt
. In this tutorial, we
shall use the pasilla RNA-seq dataset, which is available through
Bioconductor.
You can inspect the data using the head
command at the terminal:
1 $ head /path/to/data.txt
2
3 treated1fb treated2fb treated3fb untreated1fb untreated2fb untreated3fb untreated4fb
4 FBgn0000003 0 0 1 0 0 0 0
5 FBgn0000008 78 46 43 47 89 53 27
6 FBgn0000014 2 0 0 0 0 1 0
7 FBgn0000015 1 0 1 0 1 1 2
8 FBgn0000017 3187 1672 1859 2445 4615 2063 1711
9 FBgn0000018 369 150 176 288 383 135 174
10 FBgn0000022 0 0 0 0 1 0 0
11 FBgn0000024 4 5 3 4 7 1 0
12 FBgn0000028 0 1 1 0 1 0 0
13 ...
The data consists of 7 libraries/samples with 14115 features each. The libraries are grouped in two different classes, treated and untreated.
Let's use IPython to filter the data. Execute ipython
at the terminal and, at the subsequent
IPython command prompt, type the following:
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pylab as pl
4
5 counts = pd.read_table('/path/to/data.txt')
6 row_sums = counts.sum(1)
7 idxs = row_sums > np.percentile(row_sums, 40) # identifies the upper 60% of the data
8 counts_filt = counts[idxs]
9 counts_filt.head() # inspect the data
The filtered dataset contains 8461 features. The next step is to process your data. Learn how!