Using EUtils
I've been working on an updated client library in Python for
NCBI's EUtils.
It's still in alpha mode. I'm getting it in ship-shape so a couple
of my students can
present
it.
Here are some example of it in action.
Load the EUtils library
>>> import EUtils # list all the databases served by EUtils >>> dbs = EUtils.dblist() >>> dbs ['pubmed', 'protein', 'nucleotide', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'unigene', 'unists'] >>> # Information about searching PubMed >>> dbinfo = EUtils.dbinfo("pubmed") # database name >>> dbinfo.db 'pubmed' # name to display in a menu >>> dbinfo.menu_name 'PubMed' # short description >>> dbinfo.description 'PubMed bibliographic record' # the last time it was updated >>> dbinfo.last_update datetime.datetime(2005, 9, 30, 6, 16) # number of records in the database >>> dbinfo.count 15830150 # List of searchable fields >>> for field in dbinfo.field_list: ... print field.name + ": " + field.description ... ALL: All terms from all searchable fields UID: Unique number assigned to publication FILT: Limits the records TITL: Words in title of publication WORD: Free text associated with publication MESH: Medical Subject Headings assigned to publication MAJR: MeSH terms of major importance to publication AUTH: Author(s) of publication JOUR: Journal abbreviation of publication AFFL: Author's institutional affiliation and address ECNO: EC number for enzyme or CAS registry number SUBS: CAS chemical name or MEDLINE Substance Name PDAT: Date of publication EDAT: Date publication first accessible through Entrez VOL: Volume number of publication PAGE: Page number(s) of publication PTYP: Type of publication (e.g., review) LANG: Language of publication ISS: Issue number of publication SUBH: Additional specificity for MeSH term SI: Cross-reference from publication to other databases MHDA: Date publication was indexed with MeSH terms TIAB: Free text associated with Abstract/Title OTRM: Other terms associated with publication INVR: Investigator COLN: Corporate Author of publication CNTY: Country of publication PAPX: MeSH pharmacological action pre-explosions GRNT: NIH Grant Numbers MDAT: Date of last modification CDAT: Date of completion PID: Publisher ID FAUT: First Author of publication FULL: Full Author Name(s) of publication >>> # The information about a field >>> all_field = dbinfo.field_list[0] >>> all_field.name, all_field.description, all_field.term_count ('ALL', 'All terms from all searchable fields', 51491726) >>> all_field.is_date, all_field.is_numerical, all_field.single_token, \ ... all_field.hierarchy (False, False, False, False) >>>I can search; by default it searchs PubMed. I'll search for two papers I wrote some years back.
>>> results = EUtils.search("dalke AND vmd") >>> results <EUtils.HistoryClient.LiteratureRecordSet object at 0x102abb0> >>> len(results) 2 >>>I can get the database identifiers (this does an efetch of uilist under the covers)
>>> results.dbids DBIds('pubmed', ['9390282', '8744570']) >>>The e* methods are the low-level requests for a given format. They return file handles. I'll get the two records in medline format.
>>> f = results.efetch("medline") >>> print f.read() PMID- 9390282 OWN - NLM STAT- MEDLINE DA - 19980115 DCOM- 19980115 LR - 20041117 PUBM- Print DP - 1997 TI - Using Tcl for molecular visualization and analysis. PG - 85-96 AB - Reading and manipulating molecular structure data is a standard task in every molecular visualization and analysis program, but is rarely available in a form readily accessible to the user. Instead, the development of new methods for analysis, display, and interaction is often achieved by writing a new program, rather than building on pre-existing software. We present the Tcl-based script language used in our molecular modeling program, VMD, and show how it can access information about the molecular structure, perform analysis, and graphically display and animate the results. The commands are available to the user and make VMD a useful environment for studying biomolecules. AD - Beckman Institute, Urbana, IL 61801, USA. FAU - Dalke, A AU - Dalke A FAU - Schulten, K AU - Schulten K LA - eng GR - 5 P41 RR05969-04/RR/NCRR PT - Journal Article PL - SINGAPORE TA - Pac Symp Biocomput JID - 9711271 RN - 0 (Proteins) RN - 9007-49-2 (DNA) SB - IM MH - *Computer Simulation MH - DNA/*chemistry MH - Databases, Factual MH - *Models, Molecular MH - Nucleic Acid Conformation MH - *Programming Languages MH - Protein Conformation MH - Protein Structure, Secondary MH - Proteins/*chemistry MH - Research Support, U.S. Gov't, Non-P.H.S. MH - Research Support, U.S. Gov't, P.H.S. EDAT- 1997/12/09 MHDA- 1997/12/09 00:01 PST - ppublish SO - Pac Symp Biocomput 1997;:85-96. PMID- 8744570 OWN - NLM STAT- MEDLINE DA - 19961204 DCOM- 19961204 LR - 20041117 PUBM- Print IS - 0263-7855 VI - 14 IP - 1 DP - 1996 Feb TI - VMD: visual molecular dynamics. PG - 33-8, 27-8 AB - VMD is a molecular graphics program designed for the display and analysis of molecular assemblies, in particular biopolymers such as proteins and nucleic acids. VMD can simultaneously display any number of structures using a wide variety of rendering styles and coloring methods. Molecules are displayed as one or more "representations," in which each representation embodies a particular rendering method and coloring scheme for a selected subset of atoms. The atoms displayed in each representation are chosen using an extensive atom selection syntax, which includes Boolean operators and regular expressions. VMD provides a complete graphical user interface for program control, as well as a text interface using the Tcl embeddable parser to allow for complex scripts with variable substitution, control loops, and function calls. Full session logging is supported, which produces a VMD command script for later playback. High-resolution raster images of displayed molecules may be produced by generating input scripts for use by a number of photorealistic image-rendering applications. VMD has also been expressly designed with the ability to animate molecular dynamics (MD) simulation trajectories, imported either from files or from a direct connection to a running MD simulation. VMD is the visualization component of MDScope, a set of tools for interactive problem solving in structural biology, which also includes the parallel MD program NAMD, and the MDCOMM software used to connect the visualization and simulation programs. VMD is written in C++, using an object-oriented design; the program, including source code and extensive documentation, is freely available via anonymous ftp and through the World Wide Web. AD - Theoretical Biophysics Group, University of Illinois, Urbana 61801, USA. FAU - Humphrey, W AU - Humphrey W FAU - Dalke, A AU - Dalke A FAU - Schulten, K AU - Schulten K LA - eng GR - 5 P41 RR05969-04/RR/NCRR PT - Journal Article PL - UNITED STATES TA - J Mol Graph JID - 9014762 RN - 0 (Nucleic Acids) RN - 0 (Proteins) SB - IM MH - *Computer Graphics MH - *Computer Simulation MH - Computers MH - *Models, Molecular MH - Nucleic Acids/chemistry MH - Proteins/chemistry MH - Research Support, Non-U.S. Gov't MH - Research Support, U.S. Gov't, Non-P.H.S. MH - Research Support, U.S. Gov't, P.H.S. MH - User-Computer Interface EDAT- 1996/02/01 MHDA- 1996/02/01 00:01 AID - 0263785596000185 [pii] PST - ppublish SO - J Mol Graph 1996 Feb;14(1):33-8, 27-8.The methods without the 'e' prefix provide a higher-level interface by parsing the request. 'Fetch' on a literature database does an efetch of the XML format and processes the records using ElementTree, which is included with my EUtils package. (Should it be included? Should it use an existing ElementTree/cElementTree if it exists? Should split the result into multiple ElementTrees or change it to return a single ElementTree? Hmmm....)
>>> trees = results.fetch() >>> trees [<Element PubmedArticle at 29a8df0>, <Element PubmedArticle at 29b5198>] >>> trees[0].find(".//ArticleTitle").text 'Using Tcl for molecular visualization and analysis.' >>>In the previous case the XPath expression ".//ArticleTitle" means "find any element with tag "ArticleTitle".
The EUtils search string is the Entrez search string. (Additional PubMed specific help.) It is quite powerful.
# Search for publications in 1997 >>> len(EUtils.search("dalke AND 1997[PDAT]")) 3 # Authors with name "J. Smith"; note the use of double quotes >>> len(EUtils.search('"smith j"[AU]')) 2017 >>>
I'll search GenPept for bacteriorhodopsin records.
>>> proteins = EUtils.search("bacteriorhodopsin", "protein") >>> len(proteins) 459 # That's a few too many; get the first 20 records >>> proteins = proteins[:20] >>> proteins <EUtils.HistoryClient.SequenceRecordSet object at 0x29bac50> >>> len(proteins) 20 # Fetch the 'summary' format for those 20 records >>> print proteins.efetch("summary").read() 1: XP_712857 hypothetical protein CaO19.12001 [Candida albicans SC5314] gi|68486496|ref|XP_712857.1|[68486496] 2: XP_712951 hypothetical protein CaO19.4526 [Candida albicans SC5314] gi|68486305|ref|XP_712951.1|[68486305] 3: XP_713724 hypothetical protein CaO19.11148 [Candida albicans SC5314] gi|68484666|ref|XP_713724.1|[68484666] 4: XP_713758 hypothetical protein CaO19.3664 [Candida albicans SC5314] gi|68484597|ref|XP_713758.1|[68484597] 5: XP_660965 hypothetical protein AN3361.2 [Aspergillus nidulans FGSC A4] gi|67525807|ref|XP_660965.1|[67525807] 6: P69052 Archaerhodopsin 1 precursor (AR 1) (Bacterio-opsin) gi|60391839|sp|P69052|BACR1_HALSS[60391839] 7: P69051 Archaerhodopsin 1 precursor (AR 1) gi|60391838|sp|P69051|BACR1_HALS1[60391838] 8: XP_448732 unnamed protein product [Candida glabrata] gi|50292599|ref|XP_448732.1|[50292599] 9: XP_448541 unnamed protein product [Candida glabrata] gi|50292217|ref|XP_448541.1|[50292217] 10: XP_447235 unnamed protein product [Candida glabrata] gi|50289607|ref|XP_447235.1|[50289607] 11: Q9HPU8 Putative bacterio-opsin activator gi|47115564|sp|Q9HPU8|BAT_HALSA[47115564] 12: Q9F7P4 Green-light absorbing proteorhodopsin precursor (GPR) gi|32699616|sp|Q9F7P4|PRRG_PRB01[32699616] 13: Q9AFF7 Blue-light absorbing proteorhodopsin precursor (BPR) gi|32699602|sp|Q9AFF7|PRRB_PRB02[32699602] 14: O93743 Sensory rhodopsin (SR) gi|14194476|sp|O93743|BACS_HALSD[14194476] 15: O93742 Halorhodopsin (HR) gi|14194475|sp|O93742|BACH_HALSD[14194475] 16: O93741 Halorhodopsin (HR) gi|14194474|sp|O93741|BACH_HALS4[14194474] 17: O93740 Bacteriorhodopsin (BR) gi|14194473|sp|O93740|BACR_HALS4[14194473] 18: NP_010316 Protein that localizes primarily to the plasma membrane, also found at the nuclear envelope; has similarity to Hsp30p and Yro2p, which are induced during heat shock; Mrh1p [Saccharomyces cerevisiae] gi|6320236|ref|NP_010316.1|[6320236] 19: NP_009950 Hydrophobic plasma membrane localized, stress-responsive protein that negatively regulates the H(+)-ATPase Pma1p; induced by heat shock, ethanol treatment, weak organic acid, glucose limitation, and entry into stationary phase; Hsp30p [Saccharomyces cerevisiae] gi|6319869|ref|NP_009950.1|[6319869] 20: NP_009610 Putative plasma membrane protein of unknown function, transcriptionally regulated by Haa1p; green fluorescent protein (GFP)-fusion protein localizes to the cell periphery and bud; Yro2p [Saccharomyces cerevisiae] gi|6319528|ref|NP_009610.1|[6319528] # Fetch the 'fasta' format for those 20 records >>> print proteins.efetch("fasta").read() >gi|68486496|ref|XP_712857.1| hypothetical protein CaO19.12001 [Candida albicans SC5314] MSAAVSTLSDIIKRNDAVNVNPPNPIIDLHITEHGSDWLWAVFSVFALFAIVHGFIYSFTDVRKSGLKRA LLTIPLFNSAVFAFAYYTYASNLGYTWILTEFNHAGTGFRQIFYAKFVAWFLGWPLVLAIFQIITNTSFT TTEDESDLLKKFISLFEALFTRVLAIEVFVLGLLIGALIESTYKWGYFTFAVVFQLFAIYLVINDVVVSF GSSSHSVFGNALILAFVIVWILYPVAWGLSEGGNVIQPDSEAVFYGILDLITFGVIPIILTWIAINNVDE EFFTKIWHFHLKPENEHAPTATEDVEKAVGETPRHSGDTAVAPSGVPDTGVAQAQAEAEERI >gi|68486305|ref|XP_712951.1| hypothetical protein CaO19.4526 [Candida albicans SC5314] MSAAVSTLSDIIKRNDAVNVNPPNPIIDLHITEHGSDWLWAVFSVFALFAIVHGFIYSFTDVRKSGLKRA LLTIPLFNSAVFAFAYYTYASNLGYTWILTEFNHAGTGFRQIFYAKFVAWFLGWPLVLAIFQIITNTSFT TTEDESDLLKKFISLFEALFTRVLAIEVFVLGLLIGALIESTYKWGYFTFAVVFQLFAIYLVINDVVVSF GSSSHSVFGNALILAFVIVWILYPVAWGLSEGGNVIQPDSEAVFYGILDLITFGVIPIILTWIAINNVDE EFFTKIWHFHLKPENEHAPTATEDVEKAVGETPRHSGDTAVAPSGVPDTGVAQAQAEAEERI >gi|68484666|ref|XP_713724.1| hypothetical protein CaO19.11148 [Candida albicans SC5314] MAVASTFIHNNLEVMNRNTATKVNPTNSLVNMHITDHGSDWLWAAFSVFLLLTIIHLLLFLYGNFRKPGV KNSLLVIPLFTNAVFSVFYFTYASNLGYAWQAVEFQHAGTGLRQIFYAKFVAWFVGWPAVLALFEIVTST VLDRIEENPNIFKKFFLIFQTWLVKFIFVEIYVLGLLIGSIIFSTYKFGYFTFAVFFQLLLMVWVGRDLH RSFKSPSHSNIANFFLIFFYLVWILYPVAWGLSEGGNVIQPDSEAVFYGILDLITFGLMPTILIFFAIKG CDEEFFSKLWQYHVKSEAESIHENEKAVAETPSTEAGVVDAEVDNEPQAQV >gi|68484597|ref|XP_713758.1| hypothetical protein CaO19.3664 [Candida albicans SC5314] MAVASTFIHNNLEVMNRNTATKVNPTNSLVNMHITDHGSDWLWAAFSVFLLLTIIHLLLFLYGNFRKPGV KNSLLVIPLFTNAVFSVFYFTYASNLGYAWQAVEFQHAGTGLRQIFYAKFVAWFVGWPAVLALFEIVTST VLDRIEENPNIFKKFFLIFQTWLVKFIFVEIYVLGLLIGSIIFSTYKFGYFTFAVFFQLLLMVWVGRDLH RSFKSPSHSNIANFFLIFFYLVWILYPVAWGLSEGGNVIQPDSEAVFYGILDLITFGLMPTILIFFAIKG CDEEFFSKLWQYHVKSEAESIHENEKAVAETPSTEAGVVDAEVDNEPQAQV >gi|67525807|ref|XP_660965.1| hypothetical protein AN3361.2 [Aspergillus nidulans FGSC A4] MIEDALKKTVTVTQTLTETVTKAVPSHDPTSSWTTTTSVAPIPTVIPDHPTFQAVDTAAKRTLWVVTVLM ALSSLVFYILSNRVQLPKRVIHYLVATATTVSFIIYLALATGQGMDWKYDTYNHKHKHVPDTEYGIVRQV LWLRYVNWFLTGPLILASLTLLSGLPGASLFAAIVADWVMLGTGLFGTYAPNTSRKWIWFALSAIAFITL IYHIGIKGTRAANNRDSHTRRLFSAIASVALLAKALYPITLAAGPLSLKLGLTGETILFAIHDIVIQGIL GYWLVIANDAATGTNLYVDGFWSSGLGNEGAIRINEEEGA >gi|60391839|sp|P69052|BACR1_HALSS Archaerhodopsin 1 precursor (AR 1) (Bacterio-opsin) MDPIALTAAVGADLLGDGRPETLWLGIGTLLMLIGTFYFIVKGWGVTDKEAREYYSITILVPGIASAAYL SMFFGIGLTEVQVGSEMLDIYYARYADWLFTTPLLLLDLALLAKVDRVSIGTLVGVDALMIVTGLVGALS HTPLARYTWWLFSTICMIVVLYFLATSLRAAAKERGPEVASTFNTLTALVLVLWTAYPILWIIGTEGAGV VGLGIETLLFMVLDVTAKVGFGFILLRSRAILGDTEAPEPSAGAEASAAD >gi|60391838|sp|P69051|BACR1_HALS1 Archaerhodopsin 1 precursor (AR 1) MDPIALTAAVGADLLGDGRPETLWLGIGTLLMLIGTFYFIVKGWGVTDKEAREYYSITILVPGIASAAYL SMFFGIGLTEVQVGSEMLDIYYARYADWLFTTPLLLLDLALLAKVDRVSIGTLVGVDALMIVTGLVGALS HTPLARYTWWLFSTICMIVVLYFLATSLRAAAKERGPEVASTFNTLTALVLVLWTAYPILWIIGTEGAGV VGLGIETLLFMVLDVTAKVGFGFILLRSRAILGDTEAPEPSAGAEASAAD >gi|50292599|ref|XP_448732.1| unnamed protein product [Candida glabrata] MSYVDLYKRGGNEAVKINPPTGADFHITSRGSDWAWAVFCVMFFCAIVMVLLMFRKTANDRLAYYTAIAP CVFMGIAYFTIASNLGWIPVRAKYNHVRTSTQQQHPGVRQIFYARYVGWFMALPWPVIQASLMGKTPIWQ IAFNIAMTEVFTVCFLIAACVHSTYKWGYMTIGCGGAIVAMISVMTTTRHLVRAKKDGELWKGFNIYFGL VMFFWAIYPICFGITDGGNVLQPDSALIFYGILDIILYAFLPCLWVPIASYIGISNMGYNFSDAEAGMTT SNTMATVASPAMSPTPKTPKTPKTPATGKKAKKSMA >gi|50292217|ref|XP_448541.1| unnamed protein product [Candida glabrata] MVDIFTDVIQNKGGNRAISVNPPHDLDFHITKRGSDWLWAVFACFGLLMVVYIFLFFIAELKGSRITRYA IAPAFLIAMFEFFGYFTYASNLGWTGVQAEFNHVTVDTPVTGLVPGVRQIFYSKFCAWFLSWPCLLFLIE LAGIGTTLNPGEEISALDLIHSLLVQMTGTLYWVVALLVGALIHSTYKWGYFTFGAAAMLVVQGIQVRRQ FFVLKTRGFTACLLILSMLIVWAYFICWGVSEGGNKIQPDSEAIFYGILDLCIFGILPAYLVFITNHYGL WPSFKLTKSGEQEMYPEKVEDPESVRASGETAI >gi|50289607|ref|XP_447235.1| unnamed protein product [Candida glabrata] MSTFVDLYKRGGNEAVKINPPTGADFHITSRGSDWLWAAFCVFLLLAIVFVLLMFRKPVNNRFVYLTAIA PNVFMAIAYFSIASNLGWIPVRAKYNHVRTSTQQQHPGVRQIFYARYVGWFMALPWPLIQASILGKTPVW HVAFNCTMGCVFSVCFLIAACVHSTYKWGYFTIGCGAGIVSIISLMTTTYTLIKKCGDKEIKRCFLIYVC PIIVLYLVAWPVCFGITDGGNVLQPDSEAIFYGIIDLLLLGIFPALYVPMASHVGYENITYGIFDSAIGG AAPGGMAHSASMDIEKSPMSATSSPTPVSPTPKAGIKKPKLKLKK >gi|47115564|sp|Q9HPU8|BAT_HALSA Putative bacterio-opsin activator MTSVQNTESETAAGATTIGVLFAGSDPETGPAACDLDEDGRFDVTQIRDFVAARDRVDDPDIDCVVAVHE PDGFDGVAFLEAVRQTHAEFPVVVVPTAVDEDVARRAVDADATGLVPAVSEDATAAIADRIEQSAPAHSE DTETRMPISDLTVESERRLKEQALDEAPIGITISDATDPEEPIIYINDSFEDITGYSPDEVVGANHRFLQ GPKTNEDRVAEFWTAITEDHDTQVVLRNYRRDGSLFWNQVDISPIYDEDGTVSHYVGFQMDVSERMAAQQ ELQGERQSLDRLLDRVNGLMNDVTSALVRAADREEIETRITDRIGTGGEYAGAWFGRYDATEDTITVAEA AGDCEGCDGDVFDLASAGEAVALLQDVVEQREALVSTDADGVSGTADGDACVLVPVTYRSTTYGVLAVST AEHRIDDREQVLLRSLGRTTGASINDALTRRTIATDTVLNIGVELSDTALFLVELAGATDTTFEQEATIA DSQTQGVLMLVTTPHDDPQAVVDTALGYDAVQDAEVIVSTDDESVVQFDLSSSPLVDVLSECGSRVIRMH ADRTTLELDVRVGTEGAARRVLSTLRDKYADVELVAYHEDDPEQTPHGFREELRNDLTDRQLTALQKAYV SGYFEWPRRAEGKQLAESMDIVPSTYHQHLQAAKQKLVGAFFEE >gi|32699616|sp|Q9F7P4|PRRG_PRB01 Green-light absorbing proteorhodopsin precursor (GPR) MKLLLILGSVIALPTFAAGGGDLDASDYTGVSFWLVTAALLASTVFFFVERDRVSAKWKTSLTVSGLVTG IAFWHYMYMRGVWIETGDSPTVFRYIDWLLTVPLLICEFYLILAAATNVAGSLFKKLLVGSLVMLVFGYM GEAGIMAAWPAFIIGCLAWVYMIYELWAGEGKSACNTASPAVQSAYNTMMYIIIFGWAIYPVGYFTGYLM GDGGSALNLNLIYNLADFVNKILFGLIIWNVAVKESSNA >gi|32699602|sp|Q9AFF7|PRRB_PRB02 Blue-light absorbing proteorhodopsin precursor (BPR) MGKLLLILGSAIALPSFAAAGGDLDISDTVGVSFWLVTAGMLAATVFFFVERDQVSAKWKTSLTVSGLIT GIAFWHYLYMRGVWIDTGDTPTVFRYIDWLLTVPLQVVEFYLILAACTSVAASLFKKLLAGSLVMLGAGF AGEAGLAPVLPAFIIGMAGWLYMIYELYMGEGKAAVSTASPAVNSAYNAMMMIIVVGWAIYPAGYAAGYL MGGEGVYASNLNLIYNLADFVNKILFGLIIWNVAVKESSNA >gi|14194476|sp|O93743|BACS_HALSD Sensory rhodopsin (SR) MTGAVTSAYWLAAVAFLIGVGITAALYAKLEGSRARTRLAALAVIPGFAGLSYVGMALGIGTVTVNGAEL VGLRYVDWVVTTPLLVGFIGYNAGASRRAIAGVMIADALMIVFGAAAVVSGGTLKWALFGVSALFHVSLF AYLYVIFPGGIPDDPMQRGLFSLLKNHVGLLWLAYPFVWLMGPAGIGFTGAVGAALTYAFLDVLAKVPYV YFFYARRQAFIDVTDSRAAAKGDGPAVGGEAPVATGDDAPTAAD >gi|14194475|sp|O93742|BACH_HALSD Halorhodopsin (HR) MMETAADALASGTVPLEMTQTQIFEAIQGDTLLASSLWINIALAGLSILLFVYMGRNLEDPRAQLIFVAT LMVPLVSISSYTGLVSGLTVSFLEMPAGHALAGQEVLTPWGRYLTWALSTPMILVALGLLAGSNATKLFT AVTADIGMCVTGLAAALTTSSYLLRWVWYVISCAFFVVVLYVLLAEWAEDAEVAGTAEIFNTLKLLTVVL WLGYPIFWALGAEGLAVLDVAVTSWAYSGMDIVAKYLFAFLLLRWVVDNERTVAGMAAGLGAPLARCAPA DD >gi|14194474|sp|O93741|BACH_HALS4 Halorhodopsin (HR) MRSRTYHDQSVCGPYGSQRTDCDRDTDAGSDTDVHGAQVATQIRTDTLLHSSLWVNIALAGLSILVFLYM ARTVRANRARLIVGATLMIPLVSLSSYLGLVTGLTAGPIEMPAAHALAGEDVLSQWGRYLTWTLSTPMIL LALGWLAEVDTADLFVVIAADIGMCLTGLAAALTTSSYAFRWAFYLVSTAFFVVVLYALLAKWPTNAEAA GTGDIFGTLRWLTVILWLGYPILWALGVEGFALVDSVGLTSWGYSLLDIGAKYLFAALLLRWVANNERTI AVGQRSGRGAIGDPVED >gi|14194473|sp|O93740|BACR_HALS4 Bacteriorhodopsin (BR) MCCAALAPPMAATVGPESIWLWIGTIGMTLGTLYFVGRGRGVRDRKMQEFYIITIFITTIAAAMYFAMAT GFGVTEVMVGDEALTIYWARYADWLFTTPLLLLDLSLLAGANRNTIATLIGLDVFMIGTGAIAALSSTPG TRIAWWAISTGALLALLYVLVGTLSENARNRAPEVASLFGRLRNLVIALWFLYPVVWILGTEGTFGILPL YWETAAFMVLDLSAKVGFGVILLQSRSVLERVATPTAAPT >gi|6320236|ref|NP_010316.1| Protein that localizes primarily to the plasma membrane, also found at the nuclear envelope; has similarity to Hsp30p and Yro2p, which are induced during heat shock; Mrh1p [Saccharomyces cerevisiae] MSTFETLIKRGGNEAIKINPPTGADFHITSRGSDWFWTCFCCYLLFGLILTFLMFRKPVNDRFFYLTGIA PNFFMCIAYFTMASNLGWIPVKAKYNHVQTSTQKEHPGYRQIFYSRFVGWFLALPWPIIQICMLAGTPFW QMAFNVCITEFFTVCWLIAACVHSTYKWGYYTIGLGAAIVVSISVMTTSYNLVKQRDNDIRLTFLVFFSI IMFLWIIAYPTCFGITDGGNVLQPDSAGIFYGIIDLILMCFIPTLLVPIANHFGADKLGYHFGPSDAEAV MAPKAPVASPRPAATPNLSKDKKKKSKKSKKSKKSKKSEE >gi|6319869|ref|NP_009950.1| Hydrophobic plasma membrane localized, stress-responsive protein that negatively regulates the H(+)-ATPase Pma1p; induced by heat shock, ethanol treatment, weak organic acid, glucose limitation, and entry into stationary phase; Hsp30p [Saccharomyces cerevisiae] MNDTLSSFLNRNEALGLNPPHGLDMHITKRGSDWLWAVFAVFGFILLCYVVMFFIAENKGSRLTRYALAP AFLITFFEFFAFFTYASDLGWTGVQAEFNHVKVSKSITGEVPGIRQIFYSKYIAWFLSWPCLLFLIELAA STTGENDDISALDMVHSLLIQIVGTLFWVVSLLVGSLIKSTYKWGYYTIGAVAMLVTQGVICQRQFFNLK TRGFNALMLCTCMVIVWLYFICWGLSDGGNRIQPDGEAIFYGVLDLCVFAIYPCYLLIAVSRDGKLPRLS LTGGFSHHHATDDVEDAAPETKEAVPESPRASGETAIHEPEPEAEQAVEDTA >gi|6319528|ref|NP_009610.1| Putative plasma membrane protein of unknown function, transcriptionally regulated by Haa1p; green fluorescent protein (GFP)-fusion protein localizes to the cell periphery and bud; Yro2p [Saccharomyces cerevisiae] MSDYVELLKRGGNEAIKINPPTGADFHITSRGSDWLFTVFCVNLLFGVILVPLMFRKPVKDRFVYYTAIA PNLFMSIAYFTMASNLGWIPVRAKYNHVQTSTQKEHPGYRQIFYARYVGWFLAFPWPIIQMSLLGGTPLW QIAFNVGMTEIFTVCWLIAACVHSTYKWGYYTIGIGAAIVVCISLMTTTFNLVKARGKDVSNVFITFMSV IMFLWLIAYPTCFGITDGGNVLQPDSATIFYGIIDLLILSILPVLFMPLANYLGIERLGLIFDEEPAEHV GPVAEKKMPSPASFKSSDSDSSIKEKLKLKKKHKKDKKKAKKAKKAKKAKKAQEEEEDVATDSELooks like I'm getting some yeast sequences too. I only want the ones from halobacteria. That's done using the ORGN ("organism") field.
>>> proteins = EUtils.search( ... "bacteriorhodopsin AND halobacteria[ORGN]", "protein") >>> len(proteins) 224 >>>I can get the records in GenBank format, which I'll parse using Biopython. Here I'll show the gi and definition fields for each one. Err, since that's too long, how about the first 10?
>>> from Bio import GenBank >>> for record in GenBank.Iterator(proteins[:10].efetch("genbank"), ... GenBank.RecordParser()): ... print record.gi, record.definition ... 60391839 Archaerhodopsin 1 precursor (AR 1) (Bacterio-opsin). 60391838 Archaerhodopsin 1 precursor (AR 1). 47115564 Putative bacterio-opsin activator. 14194476 Sensory rhodopsin (SR). 14194475 Halorhodopsin (HR). 14194474 Halorhodopsin (HR). 14194473 Bacteriorhodopsin (BR). 3023375 Archaerhodopsin 3 precursor (AR 3). 2829812 Cruxrhodopsin-3 (COP-3) (CR-3). 2829811 Cruxhalorhodopsin-3 precursor (CHR-3). >>>(If you fetch everything you'll find that Biopython doesn't handle one of the records, which has a division of "ENV". I've added support for that in the CVS version of Biopython.)
In doing this I noticed one of the records is
1633466 Crystal Structure Of Bacteriorhodopsin In Purple Membrane.I can get more information from it directly.
>>> dbids = EUtils.DBIds("protein", ["1633466"]) >>> print EUtils.efetch(dbids, "fasta").read() >gi|1633466|pdb|2BRD| Crystal Structure Of Bacteriorhodopsin In Purple Membrane XAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVPAIAFTMYLSMLLGYGLTMVP FGGEQNPIYWARYADWLFTTPLLLLDLALLVDADQGTILALVGADGIMIGTGLVGALTKVYSYRFVWWAI STAAMLYILYVLFFGFTSKAESMRPEVASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVPLNIETLLFMV LDVSAKVGFGLILLRSRAIFGEAEAPEPSADGAAATS >>>Okay, enough for now. Still need to talk about links, searches with history, and a bit more.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB