January 22, 2021

Coding13 - Getting Started Using Python

This is where Python comes in. Python is great for running computations where real-time performance doesn’t really matter. For example, if you’re trying to analyse and process large amounts of data - be it sounds, images, video, text, or interact with data from various data sources (e.g. websites), produce graphs, or train machine learning models, it can take a lot less time if you do it in Python. So with this in mind, we’re going to learn how to program in Python.

Getting started

Python is a high-level programming language that is designed to simplify certain types of programming approaches. Python can do almost anything, mostly, Python interacts with native (C) code under the hood; it vastly simplifies many complex operations - specifically handing large amounts of data.

However, it’s not very fast, and is useless for real-time, interactive applications.

When should I use Python?

It’s the most powerful way to create web applications; it’s the easiest way to begin to understand Data Science and Machine Learning, because it makes managing data much easier; it’s a great way to connect things together to build prototypes without requiring large amounts of programming skill.

What is a Python ‘Environment’?

Because Python has a huge number of libraries and packages, sometimes they conflict / break each other. We can have different Python environments, each with different versions of software in them. Anaconda (and conda) can take care of this for us; if you use PIP to install something, it ends up in every environment you have. The easiest way to get around this is to not use ‘pip’ to install packages until you know what you are doing, and to use ‘conda’ instead.

What is Python2? Should I care?

Python is annoying because: Python2 is a bit easier to learn than Python3, but Python2 is no longer in use; luckily the differences are not actually that hard to get your head round, some of the most useful objects have been renamed: xrange is now called range.(it generates a range of numbers, e.g.1~100)

String concatenation and substitution has chaged a bit and is not really standardized. You will see people using different approaches:

1
2
x = "awesome"
print("Python is" + x) # very similar to JS

You can use the .format method and this alwats works.

e.g. print(“{}, it;s {}”.format(“hey”, “ok”))

Variables don’t need to be declared.

Four different types of arrays:

Core libraries

matplotlib

Matplotlib is a library for plotting data using an approach similar to that which can be found in the popular research software Matlab. It is designed to allow you to create plots that are publication quality, but in general, it’s just a great tool for seeing what you are doing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# this is a bit weird and easy to forget.
# here we are importing the pyplot functions as plt.
# we are also importing math so we can do some trig.

import matplotlib.pyplot as plt
import math

x = range(100)
y = []

for value in x:
y.append(math.sin(value * 0.1))

plt.plot(y)

There are lots of important core plotting features, including bar charts, pie charts, scatter plots etc. Take a look:

https://matplotlib.org/gallery/index.html

Nunpy

https://numpy.org/

Numpy is one of the most powerful and important Python packages. It is excellent for handing multidimensional arrays - e.g. large blocks of data - and has some impressive built in functions for doing vector processing and linear algebra. In general, if you are wanting to process large blocks of numbers, you should be using Numpy.

Numpy arrays are much more powerful that Python lists. They allow you to create and manipulate arrays of information, such as large blocks of image data, and process it quickly.

Quick intro to Numpy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

# creates an empty 1D array with 100 elements.
i = np.zeros([100])

# creates an empty 3D array with 5 * 5 * 5 elements
x = np.zeros([5, 5, 5])

# creates a multidimensional array 3 * 2 by 2 blocks
y = np.zeros([2, 2]*3)

print("the shape of this array is ", np.shape(i))
print(i)

print("the shape of this array is ", np.shape(x))
print(x)

print("the shape of this array is ", np.shape(y))
print(y)

z = np.arange(100).reshape(2, 5, 10)
print(z)

pandas

To be honest, the main reason people use pandas is because it can read in Microsoft excel files and csv files. This makes it handy for people who naturally use excel to collect and organise data.

There’s a good tutorial on how to import and use excel documents in to Python here:

https://www.dataquest.io/blog/excel-and-pandas/

And this cheatsheet is pretty great.

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

urllib

https://pythonspot.com/urllib-tutorial-python-3/

This is a really essential library for Python that you are going to use a lot. You can do lots of cool things that make scraping data much easier, including specifying your user agent, which basically means pretending to be any browser that you like.

It’s super easy to use url lib to grab a webpage:

1
2
3
4
import urllib.request

html = urllib.request.urlopen('https://www.arts.ac.uk').read()
print(html)

the ‘html’ variable / object in the above example now has all the data from the web page in it. But parsing HTML is not easy to do at all. Wouldn’t it be great if there was some kind of library for parsing HTML easily? That would just be amazing.

bs4

bs4, or “Beautiful Soup” is a great html parser, and the basis of a very large number of web scraping softwares. If you’re building a scraper, you should start with bs4. Here’ an example of a script that grabs some webpage data and iterates through it using bs4.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Get all the links from reddit world news.
# Can you spider those links?

from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen('http://www.reddit.com/r/worldnews').read()
soup = BeautifulSoup(html)

# just get all the links. Links are 'a' (as in <a href = "">)

for link in soup.find_all('a'):
print(link.get('href'))

bokeh

Bokeh is a great way of creating interactive plots. matplotlib isn’t designed for interactive plot generation - it’s for generating plots for books and academic papers. Bokeh on the other hand makes it super easy to make a plot that you can interact with on a webpage. Like this :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from bokeh.plotting import figure, output_file, show

# prepare some data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]

# output to static HTML file
output_file("lines.html")

# create a new plot with a title and axis labels
p = figure(title = "simple line example", x_axis_label = 'x', y_axis_label= 'y')

# add a line renderer with legend and line thickness
p.line(x, y, legend="Temp.", line_width=2)

# show the results
show(p)

https://docs.bokeh.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart

gensim

https://radimrehurek.com/gensim/

Gensim is incredibly powerful. It is a general purpose Topic modelling and natural language processing library with cutting edge features, including auto-summarisation, sentiment analysis, word-vectors, and lots of very useful topic modelling toolkits, such as Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA - https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation).

For example, if you want to summarise a document, it’s a single line of code.

1
2
3
from gensim.summarization import summarize

print(mySummary = summarize(text, word_count=150))

Of all the libraries we’ve just looked at, Gensim has possibly the most meaningful and fun set of tutorials and examples, available here:

https://radimrehurek.com/gensim/auto_examples/index.html

About this Post

This post is written by Siqi Shu, licensed under CC BY-NC 4.0.