Determining common insertion sites based on retroviral insertion distribution across tumors Academic Article uri icon

abstract

  • A CIS (common insertion site) indicates a genome region that is hit more frequently by retroviral insertions than expected by chance. Such a region is strongly related to cancer gene loci, which leads to the detection of cancer genes. An algorithm for detecting CISs should satisfy the following: (1) it does not require any prior knowledge of underlying insertion distribution; (2) it can resolve the insertion biases caused by hotspots; (3) it can detect CISs of any biological width; (4) it can identify noises resulting from statistic mistakes and non-CIS insertions; and (5) it can identify the widths of CISs as accurately as possible. We develop a method to resolve these difficulties. We verify a region's significance from two perspectives: distribution width and distribution depth. The former indicates how many insertions in a region while the latter evaluates the insertion distribution across the tumors in a region. We compare our method with kernel density estimation and sliding window on the simulated data, showing that our method not only identifies cancer-related insertions effectively, but also filters noises correctly. The experiments on the real data show that taking insertion distribution into account can highlight significant CISs. We detect 53 novel CISs, some of which have been proven correct by the biological literature.

publication date

  • 2014