This is where Python comes in. Python is great for running computations where real-time performance doesn’t really matter. For example, if you’re trying to analyse and process large amounts of data - be it sounds, images, video, text, or interact with data from various data sources (e.g. websites), produce graphs, or train machine learning models, it can take a lot less time if you do it in Python. So with this in mind, we’re going to learn how to program in Python.
Getting started
Python is a high-level programming language that is designed to simplify certain types of programming approaches. Python can do almost anything, mostly, Python interacts with native (C) code under the hood; it vastly simplifies many complex operations - specifically handing large amounts of data.
However, it’s not very fast, and is useless for real-time, interactive applications.
When should I use Python?
It’s the most powerful way to create web applications; it’s the easiest way to begin to understand Data Science and Machine Learning, because it makes managing data much easier; it’s a great way to connect things together to build prototypes without requiring large amounts of programming skill.
What is a Python ‘Environment’?
Because Python has a huge number of libraries and packages, sometimes they conflict / break each other. We can have different Python environments, each with different versions of software in them. Anaconda (and conda) can take care of this for us; if you use PIP to install something, it ends up in every environment you have. The easiest way to get around this is to not use ‘pip’ to install packages until you know what you are doing, and to use ‘conda’ instead.
What is Python2? Should I care?
Python is annoying because: Python2 is a bit easier to learn than Python3, but Python2 is no longer in use; luckily the differences are not actually that hard to get your head round, some of the most useful objects have been renamed: xrange is now called range.(it generates a range of numbers, e.g.1~100)
String concatenation and substitution has chaged a bit and is not really standardized. You will see people using different approaches:
1 | x = "awesome" |
You can use the .format method and this alwats works.
e.g. print(“{}, it;s {}”.format(“hey”, “ok”))
Variables don’t need to be declared.
Four different types of arrays:
- List is a collection which is ordered and changeable. Allows duplicate members.
- Tuple is a collection which is ordered and unchangeable. Allows duplicate members.
- Set is a collection which is unordered and unindexed. No duplicate members.
- Dictionary is a collection which is unordered and changeable. No duplicate members.
Core libraries
matplotlib
Matplotlib is a library for plotting data using an approach similar to that which can be found in the popular research software Matlab. It is designed to allow you to create plots that are publication quality, but in general, it’s just a great tool for seeing what you are doing.
1 | # this is a bit weird and easy to forget. |
There are lots of important core plotting features, including bar charts, pie charts, scatter plots etc. Take a look:
https://matplotlib.org/gallery/index.html
Nunpy
Numpy is one of the most powerful and important Python packages. It is excellent for handing multidimensional arrays - e.g. large blocks of data - and has some impressive built in functions for doing vector processing and linear algebra. In general, if you are wanting to process large blocks of numbers, you should be using Numpy.
Numpy arrays are much more powerful that Python lists. They allow you to create and manipulate arrays of information, such as large blocks of image data, and process it quickly.
Quick intro to Numpy:
1 | import numpy as np |
pandas
To be honest, the main reason people use pandas is because it can read in Microsoft excel files and csv files. This makes it handy for people who naturally use excel to collect and organise data.
There’s a good tutorial on how to import and use excel documents in to Python here:
https://www.dataquest.io/blog/excel-and-pandas/
And this cheatsheet is pretty great.
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
urllib
https://pythonspot.com/urllib-tutorial-python-3/
This is a really essential library for Python that you are going to use a lot. You can do lots of cool things that make scraping data much easier, including specifying your user agent, which basically means pretending to be any browser that you like.
It’s super easy to use url lib to grab a webpage:
1 | import urllib.request |
the ‘html’ variable / object in the above example now has all the data from the web page in it. But parsing HTML is not easy to do at all. Wouldn’t it be great if there was some kind of library for parsing HTML easily? That would just be amazing.
bs4
bs4, or “Beautiful Soup” is a great html parser, and the basis of a very large number of web scraping softwares. If you’re building a scraper, you should start with bs4. Here’ an example of a script that grabs some webpage data and iterates through it using bs4.
1 | # Get all the links from reddit world news. |
bokeh
Bokeh is a great way of creating interactive plots. matplotlib isn’t designed for interactive plot generation - it’s for generating plots for books and academic papers. Bokeh on the other hand makes it super easy to make a plot that you can interact with on a webpage. Like this :
1 | from bokeh.plotting import figure, output_file, show |
https://docs.bokeh.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart
gensim
https://radimrehurek.com/gensim/
Gensim is incredibly powerful. It is a general purpose Topic modelling and natural language processing library with cutting edge features, including auto-summarisation, sentiment analysis, word-vectors, and lots of very useful topic modelling toolkits, such as Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA - https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation).
For example, if you want to summarise a document, it’s a single line of code.
1 | from gensim.summarization import summarize |
Of all the libraries we’ve just looked at, Gensim has possibly the most meaningful and fun set of tutorials and examples, available here:
About this Post
This post is written by Siqi Shu, licensed under CC BY-NC 4.0.