August / Sept Goals + Dailies

Goals

As a recap – july goals were:

✅ Finish clam chapter intro + discussion, send to committee
✅ Get gene lists for all bivalve species
Get consensus tree (or ✅ get UCEs mapped for ~~at least 2~~ ALL species)

August / September goals are then to:

Get consensus tree
Make gene trees to run robinson foulds and identify genes of interest

Which should then put me in a good place for mapping genes of interest onto the tree and having results (or at the very least, solid tentative results) by start of the fall quarter, keeping me right on schedule with the plan I presented at my committee meeting :D

Dailies

Aug 1

Previewed some of my outputs to see if they looked as expected. The 2 bait files I was using each had 10,000 sequences that are ultraconserved throughout mollusca (bait v1) and bivalvia (bait v2), yet some of the species had very few hits, some (zebra mussel) with 0 total hits. Not sure why this is happening so spent some time reading previous issues on the phyluce github page and troubleshooting to see why some uce’s weren’t found in some genomes. Another issue was that the samples were losing sequences when converting from lastz to fasta (see below). spent some time troubleshooting that as well. Luke and I are going to meet to talk about it but we’re both out of the office for the next ~two weeks so it’ll have to wait.

Aug 19

Back in town! Met with Luke this morning to review outputs and troubleshoot together. It seems that the reduction from lastz to fasta sequences is expected as the conversion only keeps the UCE sequence (DB) that the genome sequence (query) had the strongest match to.

The issue regarding some of the samples not having any hits seems to be with the significance or identiy match specification default for phyluce. There was an ommitted argument in the phyluce alignment (was omitted in the tutorial I was following) but I was able to find it in a different phyluce and re ran the alignment with --identity 50 and --identity 80 with much more success. The average for 50 identity was ~860 (v1) and ~740 (v2), and for 80 identity was ~500 (v1) and ~550 (v2). (v1 is bait for mollusca more generally, and v2 is the bait for bivalvia more specifically). This is on par with the ~850 average UCE matches found by Li et al., 2024 (who provided the bait sequences used).

Oh and here’s the repo. Phyluce stuff is in data/phyluce/

Aug 21

now that I had my UCEs successfully harvested, I had to switch to a different phyluce tutorial. Eventually I want to have folders with fastas of UCEs that map to 50%, 75%, and 95% of the taxa, and matrixes in proper format representing these so that they can be input into IQ tree for the species consensus tree I want to build. The species consensus tree will then be what I use to run robinson foulds distance for the (repro) gene trees to identify reproductive genes of interest.

the general workflow for all this is then:

harvest UCEs (identifying which UCEs are present in each of the 17 bivalves) (Tutorial III) –>
create two data matrixes – one with complete representation (UCEs in all samples) and one with incomplete (UCEs in at least 3 samples). treat the UCE outputs from tutorial III as contigs (daily use phyluce tutorial) –>
use these matrixes to then extract, align, and trim UCEs of interest and generate fasta files for them (tutorial I, starting with the aligning UCE loci)

so today I took the outputs from tutorial III and started working on the data matrices to proccess UCEs.

Aug 22

finished with the daily use tutorial and got UCE fastas for all the UCEs found within my bivavles. next step is to continue onto tutorial I and align and trim them to prep for IQ tree.

Aug 25

raven down…womp womp

started reviewing some of the comments on my proposal draft from like 9 months ago lol. realized a common theme was that i needed more sources and a bit more explanation regarding 1) how gametogenesis in bivalves works, and 2) why it matters. So started some lit review for that. mostly reading chapters 1, 2, 4, and 6 in “Reproduction in Aquatic Animals”: (Yoshida M, Asturiano JF (eds) (2020) Reproduction in aquatic animals, 2020th ed. Springer, Singapore, Singapore.) which covered spermatogenesis, oogenesis, fertilization, etc.

Aug 26

Raven was still down in the morning so I continued with some of my lit review in the morning looking at parental effects on offspring (specifically looking for paternal impacts since the manila clam stuff gave a decent dive into maternal influence).

Then started work on tutorial I

Aug 27

continued and concluded work on tutorial I. aligned and trimmed the uce loci using mafft alignment and internal trimming (due to potential evolutionary distance of some of the bivalve species) and cleaned alignments. then was able to generate data folders of 50%, 75% and 95% representation (folder contained the fasta files of UCEs present for each % of taxa and stated uce and corresponding taxa sequence) and matrices that concatenated these files. Then converted the concatenated files into phyllip format for use in IQ tree.

Aug 28

Created the IQ trees and visualized with FigTree. Noticed an error fairly upstream in the UCE/phyluce code so went back and edited that and re ran all downstream analyses. Wrote up methods to date for whole consensus tree process. Will meet with luke after long weekend to review.

Sept 2

Reviewed consensus tree outputs with Luke. There is still some disagreeement among certain relationships between trees. Need to re run the trees and add outgroups this time to help resolve.

Sept 3

Added outgroups and it seemed to not resolve some of the issues with disagreement among trees. changing to just 1 outgroup to more firmly root the tree. (the processes of re running to do so takes about 6hours so this is pretty much all i did today and yesterday).

Sept 4

Single outgroup trees still disagreeing, but with aide from literature, could find support for one of the trees (50p coverage of v1 baits). See methods for more details.

Going out of town again for the next two weeks with limited access to wifi or cell service.

Sept 10

Created a list of all non-duplicate SPIDs found across all of the genomes and then used a loop to go through and pull the associated sequence ID for each taxa from their blast output, and then extract their sequence for that gene and add it to a fasta file. Resulted in ~3200 unique SPIDs and now have a fasta file for each that I can use to create gene trees.

Tonight is also the first night of the GRFP workshop that I’m mentoring for

Sept 17

still “out of office” but attended GRFP workshop virtually

Sept 22

First day back on campus (though classes haven’t started yet). did the CPR training since mine expires this month and its required for TAing.

When I initially ran IQ tree for the consensus trees, I did it locally just to make sure it worked before going through the process of adding it to raven and doing all the mamba stuff again. Today I went in and added it to raven and updated the code accordingly. Finally, I went in and checked the outputs from my gene treefile loops. All looking good.

Sept 23

Doing some final edits on the consensus trees (I want to make sure everything is good in that department before I run the RF distance loop). Since I settled on using one outgroup, needed to go in and adjust the retrieve genomes file to only retrieve the one outgroup (instead of the initial 3) and then needed to adjust the taxa parameters to account for using a single outgroup. (ICYW: need to just use 1 out group as that is all IQ tree will accept. Use of multiple out groups will result in unresolved nodes in the tree). Used GGtree to visualize the tree (not as pretty as figtree, but able to do in a more reproducible way and visuals will come with time I’m sure).

Sept 24

First TA meeting for 311 & first day of class

First day of class for the natural resource policy course I’m taking, and had a few chapters of reading due before so most of today was spent in class or reading.

GRFP workshop

Sept 25-29

Out of town for a family commitment, last excursion for awhile thank goodness. Really only had time to keep up with homework while out.

Sept 30

The treefiles created last week look good, but there was a difference in the number of output files vs the number of input ‘entries’ – 30 genes unaccounted for. IQ tree will not generate a tree if there are less than 3 species represented (which is self explanatory as the tree would just look like [ ). But I want to double check nonetheless. So from my big SPID list I created a couple weeks ago (which I had the foresight to create a ‘count’ column associated with when created) to determine how many genes had less than 3 taxa representatives. Sure enough, 30. To check if these were the same, I saved names of the 30 to a dataframe, and then used a loop to identify which directories in the genefiles output subdir had no .treefile , and saved those to a dataframe. Then compared using all_equal() and they were! Gene trees lookin good to proceed to next step! Just want to do one final check on consensus first….