Adam P Cribbs

Computational biologist at the University of Oxford

R - Convert a Ensembl gene id to gene symbol

Written on May 1, 2020

One of the most frustrating things about working with ensembl gene Ids is that as humans, we like to see Gene symbols so that we can look through our lists and see our favourite genes.

I always found converting Ensembl Ids to Symbols in R really annoying. However, there have been a number of packages produced that have helped do this more efficienctly. I will show you how this can be done in two different ways, (1) using the org.Hs.eg.db and (2) annotables.

Org.Hs.eg.db

library(org.Hs.eg.db)
infile = read.csv("data/gene_lists.csv")
data = infile[,"ENSEMBL"]

# You can see that the subset data is listed as a integer, we now need to convert
# this to a vector to pass it into the annotation mapping

data = as.vector(data)

# Using the org.Hs.eg.db we set up mapping info - if you look at the documentation you
# can also obtain other keytypes

annots <- select(org.Hs.eg.db, keys=data, 
                columns="SYMBOL", keytype="ENSEMBL")

result <- merge(infile, annots, by.x="ENSEMBL", by.y="ENSEMBL")

print(head(result))
##           ENSEMBL log2FoldChange   baseMean        padj   SYMBOL
## 1 ENSG00000000419     0.18713483 1487.61562 0.377139110     DPM1
## 2 ENSG00000000457     0.02922775  519.65035 0.927794309    SCYL3
## 3 ENSG00000000460     0.31120415  125.54166 0.380726319 C1orf112
## 4 ENSG00000000971    -0.18479425   79.10841 0.818624222      CFH
## 5 ENSG00000001036     0.39103001   53.04816 0.532827630    FUCA2
## 6 ENSG00000001084    -0.51993067 1065.13861 0.001388325     GCLC

Annotables

library(annotables)
library(tidyverse)
infile = read.csv("data/gene_lists.csv")

infile %>% 
    dplyr::arrange(padj) %>% 
    head(20) %>% 
    dplyr::inner_join(grch38, by = c("ENSEMBL" = "ensgene"))
##    log2FoldChange   baseMean     padj         ENSEMBL    entrez symbol chr
## 1      -0.9193411  3487.4362 1.42e-22 ENSG00000186480      3638 INSIG1   7
## 2      -0.8311274  5338.9306 6.31e-18 ENSG00000137309      3159  HMGA1   6
## 3       1.7218713   307.1204 6.31e-18 ENSG00000214026      6150 MRPL23  11
## 4       1.7218713   307.1204 6.31e-18 ENSG00000214026 107987373 MRPL23  11
## 5       1.7218713   307.1204 6.31e-18 ENSG00000214026      6150 MRPL23  11
## 6       1.7218713   307.1204 6.31e-18 ENSG00000214026 107987373 MRPL23  11
## 7       1.3370923   506.9614 1.74e-17 ENSG00000136286     64005  MYO1G   7
## 8      -0.9048246 26303.5720 4.00e-17 ENSG00000087086      2512    FTL  19
## 9       1.3622491   349.2231 3.38e-16 ENSG00000122122     54440  SASH3   X
## 10      1.0544227  1841.3427 3.29e-14 ENSG00000116824       914    CD2   1
## 11      1.1525886   579.1535 1.18e-13 ENSG00000183735     29110   TBK1  12
## 12      1.1569100   492.1651 3.76e-13 ENSG00000133561    474344 GIMAP6   7
## 13      1.1097412  1768.2973 3.71e-12 ENSG00000171552       598 BCL2L1  20
## 14      0.8687971  2918.0275 2.18e-11 ENSG00000003402      8837  CFLAR   2
## 15      1.5528225   179.4745 3.49e-11 ENSG00000160685     51043 ZBTB7B   1
## 16     -0.8610968  4122.4284 4.47e-11 ENSG00000167996      2495   FTH1  11
## 17     -0.9730979   612.3984 2.05e-09 ENSG00000130766     83667  SESN2   1
## 18      1.1255855   284.6935 2.05e-09 ENSG00000143891    130589   GALM   2
## 19     -0.7757254  1545.5724 3.91e-09 ENSG00000162413      9903 KLHL21   1
## 20      0.9742609   657.1901 4.98e-09 ENSG00000197142     51703  ACSL5  10
## 21      0.8464779  1131.9277 8.26e-09 ENSG00000069493     29121 CLEC2D  12
## 22      1.1942058  1151.1959 1.33e-08 ENSG00000111679      5777  PTPN6  12
##        start       end strand        biotype
## 1  155297776 155310235      1 protein_coding
## 2   34236873  34246231      1 protein_coding
## 3    1947278   1984522      1 protein_coding
## 4    1947278   1984522      1 protein_coding
## 5    1947278   1984522      1 protein_coding
## 6    1947278   1984522      1 protein_coding
## 7   44962662  44979098     -1 protein_coding
## 8   48965301  48966878      1 protein_coding
## 9  129779979 129795201      1 protein_coding
## 10 116754385 116769228      1 protein_coding
## 11  64451880  64502108      1 protein_coding
## 12 150625375 150632648     -1 protein_coding
## 13  31664452  31723989     -1 protein_coding
## 14 201116104 201176687      1 protein_coding
## 15 155002630 155018522      1 protein_coding
## 16  61959718  61967660     -1 protein_coding
## 17  28259527  28282491      1 protein_coding
## 18  38665910  38741237      1 protein_coding
## 19   6590724   6614607     -1 protein_coding
## 20 112374018 112428380      1 protein_coding
## 21   9664969   9699555      1 protein_coding
## 22   6946468   6961316      1 protein_coding
##                                                                             description
## 1                             insulin induced gene 1 [Source:HGNC Symbol;Acc:HGNC:6083]
## 2                      high mobility group AT-hook 1 [Source:HGNC Symbol;Acc:HGNC:5010]
## 3               mitochondrial ribosomal protein L23 [Source:HGNC Symbol;Acc:HGNC:10322]
## 4               mitochondrial ribosomal protein L23 [Source:HGNC Symbol;Acc:HGNC:10322]
## 5               mitochondrial ribosomal protein L23 [Source:HGNC Symbol;Acc:HGNC:10322]
## 6               mitochondrial ribosomal protein L23 [Source:HGNC Symbol;Acc:HGNC:10322]
## 7                                         myosin IG [Source:HGNC Symbol;Acc:HGNC:13880]
## 8                               ferritin light chain [Source:HGNC Symbol;Acc:HGNC:3999]
## 9                   SAM and SH3 domain containing 3 [Source:HGNC Symbol;Acc:HGNC:15975]
## 10                                      CD2 molecule [Source:HGNC Symbol;Acc:HGNC:1639]
## 11                            TANK binding kinase 1 [Source:HGNC Symbol;Acc:HGNC:11584]
## 12                     GTPase, IMAP family member 6 [Source:HGNC Symbol;Acc:HGNC:21918]
## 13                                        BCL2 like 1 [Source:HGNC Symbol;Acc:HGNC:992]
## 14           CASP8 and FADD like apoptosis regulator [Source:HGNC Symbol;Acc:HGNC:1876]
## 15         zinc finger and BTB domain containing 7B [Source:HGNC Symbol;Acc:HGNC:18668]
## 16                            ferritin heavy chain 1 [Source:HGNC Symbol;Acc:HGNC:3976]
## 17                                        sestrin 2 [Source:HGNC Symbol;Acc:HGNC:20746]
## 18                             galactose mutarotase [Source:HGNC Symbol;Acc:HGNC:24063]
## 19                      kelch like family member 21 [Source:HGNC Symbol;Acc:HGNC:29041]
## 20   acyl-CoA synthetase long chain family member 5 [Source:HGNC Symbol;Acc:HGNC:16526]
## 21           C-type lectin domain family 2 member D [Source:HGNC Symbol;Acc:HGNC:14351]
## 22 protein tyrosine phosphatase, non-receptor type 6 [Source:HGNC Symbol;Acc:HGNC:9658]