16th April 2021 • Mabel Lee
This article is aimed to equip coding for noobs who just graduated from CS1010, or have the coding literacy of its equivalent (i.e. you can write while iterations, and code mathematical operations). This article is birthed from my own painful experience of being a coding noob, and I hope to help my fellow friends out there to start dipping their toes in the application of their coding knowledge! (P.S. novices and experts, feel free to share in the comments section below useful resources for us who are in the baby pool!)
First up, know where to find information. There is sooo! much! jargon! that I used a Google Sheet to keep track of them, because it really was not easy for me to remember and recall them every time I read a new article.
For learning new concepts, other than Cousera and the likes of online learning platforms, YouTube and towardsdatascience.com were the coding language equivalents of the Bible for me. Channels on Reddit like r/datascience, r/learnmachinelearning, and r/learndatascience are great subreddits, where you can get nuggets of information daily. Also, this link was highly recommended by Joel (our AY20/21 President woohoo!!) for learning ML, and he says he finds it reliable and practical as they have coding practices.
TowardsDataScience is the best platform for helping you understand what various concept are, and their usefulness in real life concepts. Furthermore, many articles even have step-by-step codes you can follow along to for the entire trajectory of the concept. The articles are well-written, jargon is properly explained, and the language tends to be easily digestible, as I think their audience are made up of people of varying skill levels. There is limited free access to articles, but it is such a great resource that I think it's totally worth spending 5USD a month on it (just drink 2 less bubble tea la*). You also can hack the system and view some articles in Incognito Mode (not 100% foolproof for me, still couldn’t access some articles). Huehue, don’t tell Medium I said that*.
YouTube is a good place to find crash courses, where you code-along with the Youtuber. It is a better resource for introductions to languages, where some Youtubers run through code to help you understand how to write your own. It will teach you the basics, like writing for/while loops, loading libraries, and understanding dictionaries. I used YouTube to supplement my learning when I took CS1010S, and I picked up pandas (a library) on YouTube as well! It really was useful to be able to hear them explain the logic when coding out an algorithm. However, at higher levels, where you need to understand concepts (e.g. Machine Learning), I believe TowardsDataScience can explain these ideas better.
Here are some common terms you will face (and here are my explanations for what they are):
1) Github: It’s like an open platform where people upload their code. It can be used to display your coding achievements, where you upload your code after your competitions and projects are completed, etc.
2) Libraries:
Just like how a library has many books, a library just means a bunch of functions in a package that you can load into your application (e.g. RStudio/IDLE/Jupyter). For example, to draw graphs in R, you will load the library named “ggplot2”, and call the functions within the library with such a code.
Figure 1: Example of loading a library, and then calling it
3) Environments
I like to think of environments as different swimming pools. What you do to the swimming pool at Jurong will not disturb the one at Bedok. Basically, it keeps the 2 environments separate. So, the libraries you download/load into environment A will not be present in environment B.
4) Train-test split (specifically for Machine Learning):
Not Thomas and trains*. When working with datasets, it is important to split data into 2 sets (usually 70%-30%, but it can be more extreme, like 99%-1% for larger datasets). This is so you can define and refine a model to match the training dataset, and make sure it works well on a dataset it has never seen before.
Imagine doing a 10-year series repeatedly - it doesn’t show how good you are until you take your examinations. It is like doing the same practice papers repeatedly when you generalise your results to your actual national exam - even if your model works well on your training dataset, you still only know how good you are when you test your model on your test set you set aside.
5) Structured/Unstructured Data
Just refers to data that are rigid, clearly defined (e.g. tables with numerical values) vs data with no structure (e.g. sound). The table below summarises the differences well. By identifying the nature of the data, you know how approach the data effectively.
Figure 2: Summary of Structured vs Unstructured data
Here are some coding tricks I learnt:
Put ‘?’ or ‘??’ in front of a function to find the documentation (information) of it. ‘?’ shows less detail and ‘??’ shows the entire code . In R, you can use ‘?’ to find the function itself, and ‘??’ to search for random keywords that may not be present in the function name. I find it useful to see what the inputs of the variables should be, but annoyingly, the language they explain it with is very jargon-y so you might have to use Google to supplement what they are referring to (e.g. ??plot).
For RStudio, always set the working directory to wherever you want to retrieve your data from (Sessions>Set Working Directory). Basically, it sets the directory to be at wherever your R code file is, so if you have a problem calling your data, just set it to be at the folder where your data is. Just imagine bringing the claw machine claw to the basket you are in, aids easy retrieval.
Learn all the keyboard shortcuts (do a quick Google search). This really helped me save so much time because I find it quite annoying to take my hand off the keyboard to use my mouse. For example, in Jupyter notebook (Python), there is a list of keyboard shortcuts for creating new cells, etc.
After all these learning of concepts and playing with your code, it is time to put into practice what you have learnt! Perhaps you want to start dipping your toes, to do some projects, or have some workbooks to test out your knowledge. There are websites like LeetCode or Edabit, where you can work with coding and algorithm questions, and have someone “mark” your code, almost like Coursemology.
Additionally, Kaggle is a very popular coding community, where people post datasets (both real and fake) that you can download and play with. There are also many competitions you can join, including past competitions. Try attempting these competitions, most of the elementary ones are well-documented. A Google search is sufficient for you to find complete workbooks of code that you can check against.
Lastly, something I learnt is definitely don’t be shy to ask for help! I was very blessed, my first teammates I worked with were very patient with me, answering my many queries (though it's probs so elementary for them LOL). I also met a lot of humble enthusiasts in NUS Statistics Society, who were genuinely excited not just to share their passion, but also to extend help to anyone in need. If you dare to ask, I'm sure they are more than willing to help.
I hope these helped you have a better direction to head into! Don’t be daunted by the steep learning curve, it is like learning a new language after all. After the teething stage, you will start to appreciate and be more comfortable at coding! If you are ready to venture into the big boy concepts like SQL or even Natural Language Processing (NLP), you can click here for the past workshops we held, and keep a lookout as we prep more amazing workshops for you next AY!
Editor's Note: There are a lot of links and resources provided in this article, so I've compiled them below, including some that I've used in my limited time doing similar work. Thanks for the article, Mabs!
Subreddits
Articles/Information
Practice Questions/Challenges
Datasets