This blog post describes a complementary (or alternative) approach – how to use BigQuery ML to create a (simpler) regression model using a SQL like syntax. The advantage of this approach is that training time is very rapid and that it is very easy to use for batch-based predictions also using SQL like syntax. The blog assumes that you have a table with data in BigQuery (see previous blog post for how to do that)
Test the query that is the core part of the model – training data
Most of the data needed is already in the table, but in addition need to create the label to predict using LEAD() method and since BigQuery ML requires non-NULL data (e.g. not having NaN values) that is set to 0 for the label with IFNULL.
1
2
3
4
5
SELECT
IFNULL(lead(close)OVER(ORDER BY formatted_timestamp),0)ASlabel,
*
FROM
`bitcoindata.bcdata`
Figure 1 – BigQuery SQL query generating training data
2. Generate the regression model for predicting Bitcoin price
1
2
3
4
5
6
7
8
#standardSQL
CREATE MODEL`bitcoindata.bitcoinlinpred`
OPTIONS(model_type='linear_reg')AS
SELECT
IFNULL(lead(close)OVER(ORDER BY formatted_timestamp),0)ASlabel,
*
FROM
`bitcoindata.bcdata`
3. Evaluate model – get predicted_label as new value
1
2
3
4
5
6
7
SELECT ABS(predicted_label-label)asdelta FROM ML.PREDICT(MODEL`predicting.bitcoindata.bitcoinlinpred`,(
SELECT
IFNULL(lead(close)OVER(ORDER BY formatted_timestamp),0)ASlabel,
*
FROM
`bitcoindata.bcdata`
))
Conclusion
Have shown how to use BigQuery ML regression on a BitCoin dataset to predict Bitcoin price, given how easy this is to use even at large scale (e.g. several hundred billion rows) this can be good start when doing predictions in on tabular data.
Applying Artificial Intelligence (AI) frequently require surprising amount of (tedious) manual work. Tools for automation can make AI available to more people and to rapidly solve many more important challenges. This blog posts tests such a tool – AutoML Tables.
Figure 1 – Bitcoin price prediction – is it going up, down or sideways?
This blog post generates a data set from an API and applies automated AI – AutoML Tables for regressionto predict numbers – in this case Bitcoin closing price next hour based on data from the current hour.
1. Introduction
AutoML Tables – can used on tabular data (e.g. from databases or spreadsheets) for either: classification (e.g. classify whether it is an Valyrian steel Sword or not – as shown in Figure 2a) or regression(predict a particular number, e.g. reach of Scorpion canon aiming at dragons as shown in the top figure 2b)
Figure 2a- Classification Example – Is Aria’s Needle Sword of Valyrian Steel?Figure 2b – Regression Example – reach of Euron’s arrow
2. Choosing and Building Dataset
However, I didn’t find data sets that I could use for Valerian steel classification or Scorpion arrow reach regression (let me know if such data exists), but instead found a free API to get bitcoin related data over time instead and since I assume Bitcoin is completely unrelated to Valyrian steel and Scorpion (however, I might be wrong about that given that Valyrian steel furnaces might compete with Bitcoin about energy – perhaps a potential confounding variable to explain a potential relationship between prices of Valyrian swords and Bitcoin?) .
Scientific selection of Bitcoin data API: Since I am not an expert in cryptocurrency I just searched for free bitcoin api (or something in that direction) and found/selected cryptocompare.com
2.1 Python code to fetch and prepare API data
Materials & Methods
I used a colab (colab.research.google.com) to fetch and prepare API data, in combination with AutoML web UI and a few Google Cloud command line commands (gsutil and gcloud methods). Also used Bigquery for storing results and AutoML stored some output related to evaluation in Bigquery
Imports and authentication (Google Cloud)
1
2
3
4
5
6
7
8
9
10
11
12
CRYPTOCOMPARE_API_KEY=""# get your own on cryptocompare.com
import requests# for API request to cryptocompare.com
from datetime import datetime asdt
from google.colab import auth,drive
import json
# storing Bitcoin API data in bigquery
from google.cloud import bigquery
from google.api_core.exceptions import BadRequest
auth.authenticate_user()# authentication on Google Cloud
Create a Bigquery schema based on the API data fetched
Note: bigquery-schema-generator was a nice tool, but had to change INTEGER to FLOAT in the generated schema in addition to prepare data (ref perl oneliner)
!bq mk--table--expiration3600--description"This is my table"--label predicting:bitcoindata bitcoindata.bctrainingdata bitcoindata.schema
Load API data into (new) Bigquery Table
1
2
3
4
5
!bq load--source_format NEWLINE_DELIMITED_JSON\
--ignore_unknown_values\
bitcoindata.bctrainingdata\
gs://predicting/bitcoindata.json \
bitcoindata.schema
Check that the table exists and query it
1
!bq show predicting:bitcoindata
1
2
%%bigquery--projectpredictingbitcoindata
select *from bitcoindata.bctrainingdata
Figure 3 – output from select query towards Bitcoin data in Bigquery
We have input (x) features, but not a feature (y) to predict(!)
Create a column to predict can be done by creating a new column that is time shifted, e.g. for a time t=0 there is a particular row that require a t=1 feature to train – the feature we want to predict is the Bitcoin close price next hour (e.g. not exactly quant/high-frequency trading – but a more soothing once-per-hour experience, if it works out ok it can be automated – for the risk taking?). This can be generated either in Bigquery with select and LEAD() method or with a Python Pandas Dataframe shift – showing both approached underneath.
1
2
%%bigquery--projectpredictingbitcoindata
select lead(close)OVER(ORDER BY formatted_timestamp)asNEXTCLOSE,*from bitcoindata.bctrainingdata ORDER by formatted_timestamp desc
Now the data is ready for AutoML (Note that the step with Bigquery could have been avoided in this case, but could also be another direction since AutoML can import directly from Bigquery). Underneath you can see an example of a created dataset in AutoML Console.
Figure 4 – AutoML Console – with an example data set named bitcoindata
Creating a new dataset
Figure 5 – create new AutoML Tables dataset
Importing data from Google Cloud Bucket
Figure 6 – Import data to AutoML from Google Cloud StorageFigure 7 – Importing data
Set target variable (NEXTCLOSE) and look at statistics of features
Figure 8 – select target column and data split (train/validation/testing)Figure 9 – inspect correlation with target variable
Train Model with AutoML
Figure 10 – select budget for resources to generate model
Look at core metrics regarding accuracy
Figure 11 – Metrics from training (MAE, RMSE,R^2, MAPE) and important features
Deploy and Use the AutoML model – Batch Mode
Figure 12 – Batch based predictionFigure 13 – online prediction
Conclusion
Have shown an example of using AutoML – the main part was about getting data from an API and preparing to use it (section 2), and using it in AutoML to train a model and look into evaluation. This model aims to predict the next hour bitcoin closing based on data from the current hour, but can probably be extended in several ways – how would you extend it?
My impression is that the software engineer profession is slowly transitioning from academic to craft (closer to being e.g a carpenter or electrician) perhaps due to changes in online learning opportunities, cloud computing and industry needs. What do you think?
This blog post looks into which methods and technologies that can potentially lead to the replacement of coders in the future, some are of futuristic nature but some are more “low-hanging” wrt automation of (parts of) coding.
Also related to this is (Tesla AI director) Andrej Karpathy’s article: Software 2.0 where he looks back at how AI (primarily Deep Learning) has replaced custom methods for e.g. image and speech recognition, machine translation (++) and generalizes how Deep Learning can further replace a lot of other software in years to come (note: examplified by Google’s recent paper The Case for Learned Index Structures)
1. FACT: Programming environments (IDEs) have barely changed the last 30 years
One of primary purposes of programming is to provide efficient automation, however programming itself is still a highly manual and labour intensive process – except for refactoring the difference between modern IDEs compared to e.g. Turbo Pascal in 1989 is surprisingly small? (Turbo Pascal came out more than 30 years ago and improved gradually towards 1989)
2. FACT: For (close to) all FUNCTIONS written there exists one or several TESTS for it
For any method already written (or to be written) in any (of the most popular) languages currently used in programming there already exists a test for it – in the same language or in a similar language (e.g. a C# test for a Java function). The obvious (big) data source for this is all private and public repositories in Github (100M pull requests merged so far in 2017)
So why are most developers still writing unit tests instead of having an IDE/service find and translate the appropriate tests to the functions they write? (e.g. something along the lines of IntelliTest)
3. FACT: as 2 – For (close to) all TESTS there exists a (set of ) FUNCTIONs they test
Assuming e.g. with Test Driven Development (TDD) – where you are roughly writing the (test of the new) API first – and then alternating between creating in the code to (just enough) fulfill the API.
This seems like it has 2 potential ways of being increasingly automated – on the function writing part.
Search for a set of function that matches a set of tests – instead of writing the functions just write the tests
Automatically generate the code (fragments) to fulfill the test, this can potentially be done in many ways, e.g.
Brute force with a lot of computers (e.g. a few thousand CPUs in the cloud should be more than capable of quickly generating and selecting the best of maybe up to 30-50 increments needed per test writing iteration, this resource could be shared by a large set of programmers). See also the science of brute force.
4. FACT: (Many?) Data Structures can now be replaced by AI (e.g. Bloom Filters & B-Trees)
Google – with Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis published an interesting paper: The Case for Learned Index Structures where they showed that traditional and popular data structures such as B-Tree, Hash Index and Bloom Filter can be with advantage be replaced by AI Trained Index Structures. This is from my perspective pretty groundbreaking research and will be interesting to see what the use cases can be towards program synthesis (e.g for other categories of data structures and also logic operating on data structures). Key result figures from their paper:
5. FACT: Formal Methods are starting to work and can be used to support automation of code
A few years back I participated in a research project – Envisage (Engineering Virtualized Services) – where formal methods were used to prove that the TimSort algorithm was buggy in a lot of languages (See Hacker News story below). Formal methods can potentially be used together with code generating methods to ensure that what is generated is correct.
Conclusion
Have presented various technologies that might play a role in automating coding forwards. A potential interesting direction on the generative side for coding is to use sequence to sequence deep learning in combination with GAN for synthesis, see e.g. Text2Action: Generative Adversarial Synthesis from Language to Action – for program synthesis this looks like a
Long overdue update of new publications in Deep Learning Publication Navigator (ai.amundtveit.com) – for now the easiest way to discover new publications is probably to convert screenshots (number of papers) per category in the before and after update screenshots below.
Examples of keywords (from publication title) with (several) new Deep Learning publications are:
This blog post shows a basic example of a Serverless Thrift API with Python for AWS Lambda and AWS API Gateway.
1. Serverless Computing for Thrift APIs?
Serverless computing – also called Cloud Functions or Functions as a Service (FaaS) – is an interesting type of cloud service due to its simplicity. An interpretation of serverless computing is that you (with relatively low effort):
Deploy only the function needing to do the work
Only pay per request to the function
With the notable exception of other cloud resources used, e.g. storage
Get some security setup automation/support (e.g. SSL and API keys)
Get support for request throttling (e.g. QPS) and quotas (e.g. per month)
Get (reasonably) low latency – when the serverless function is kept warm
Get support for easily setting up caching
Get support for setting up custom domain name
Lower direct (cloud costs) and indirect (management) costs?
These are characteristics that in my mind make Serverless computing an interesting infrastructure to develop and deploy Thrift APIs (or other types of APIs) for.
Perhaps over time even Serverless will be preferred over (more complex) container (Kubernetes/Docker) or virtual machine based (IaaS) or PaaS solutions for APIs?
2. Example Cloud Vendors providing Serverless Services
Since Python is a key language in my team, for this initial test I choose the AWS option also since I am most familiar with AWS and the open source tooling for AWS was best wrt Python (runner up was Microsoft Azure Functions).
3. Thrift (over HTTPS) on AWS Lambda and API Gateway with Python
This shows an example of the (classic) Apache Thrift tutorial Calculator API running on AWS Lambda and API Gateway, the service requires 2 thrift files:
The tool used for deployment in this blog post is Zappa, I recommend using Zappa together with Docker for Python 3.6 as described in this blog post, with a slight change of the Dockerfile if you want to build and compile Apache thrift Python library yourself, here is the altered Dockerfile. There hasn’t been official releases of Apache Thrift since 0.10.0 January 6th 2017, and there has been important improvement related to its Python support since last release – in particular the fix for supporting recursive thrift structs in Python
a. Dockerfile – for creating a Zappashell (same as Lambda runtime ) and builds Thrift
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# build this with command:
# docker build -t myzappa .
FROM lambci/lambda:build-python3.6
WORKDIR/var/task
# Fancy prompt to remind you are in zappashell
RUN echo'export PS1="\[\e[36m\]zappashell>\[\e[m\] "'>>/root/.bashrc
# Build Apache thrift Python library
RUN yum clean all&&\
yum-yinstall emacs boost*gcc
RUN git clonehttps://github.com/apache/thrift.git && \
cd thrift&&\
./bootstrap.sh&&\
./configure&&\
make&&make install&&\
cd lib/py&&python setup.py install&&\
python setup.py sdist# Builds a thriftSomeVersion.tar.gz
CMD["bash"]
After building this Dockerfile (see command on top of file) and adding zappashell to your .bash_profile like this (source: the above mentioned blog post)
1
2
alias zappashell='docker run -ti -e AWS_PROFILE=zappa -v $(pwd):/var/task -v ~/.aws/:/root/.aws --rm myzappa'
alias zappashell>>~/.bash_profile
You can start your serverless deployment environment with the command zappashell (inside an new empty directory on your host platform e.g. a mac), this gives something like this – with an empty directory.
1
2
3
4
5
6
7
username@MyMac$mkdir my_thrift_app
username@MyMac$cd my_thrift_app
username@MyMac$zappashell
[zappashell>pwd
/var/task
[zappashell:ls
[zappashell:
Install virtualenv and create/activate an environment(and assuming you installed thrift as shown in Dockerfile above)
Copy the generated thrift library – note: thrift itself not the tutorial code – (ref thriftSomeVersion.tar.gz generated by python setup.by sdist in Dockerfile) to the same directory and add it to requirements.txt
But wait, something is missing, this API is reachable by anyone. Let us add an API key (and update the client with the x-api-key). This can be done through AWS Console (and perhaps with Zappa itself through automation soon?) with the following steps:
Go to Amazon API Gateway Console and click on the generated API (perhaps named task-beta due to the Docker file path and the selected stage during zappa init)
Create a Usage Plan and associate it with the API (e.g. task-beta), then create an API Key (on the left side menu) and attach the API Key to the Usage Plan
Do a zappa update dev and and uncomment/update the transport.setCustomHeaders with x-api-key in the python client above to get authentication and throttling in place.
4. Conclusion
Have shown an example of getting thrift API running on Serverless that can relatively easily be automated, and when the API is initially created it is very little effort to update it (e.g. through continuous deployment).
A final note on roundtrip time performance, based on a few rough tests it looks like the roundtrip time for calls to API is around 300-400 milliseconds (with the test client based in Trondheim, Norway and accessing API Gateway in AWS and AWS Lambda in Germany), which is quite good. Believe that with an AWS Route53 Routing Policy one could have automatic selection of the closest AWS API Gateway/Lambda to get the lowest latency (note that one of the selections in zappa init was to deploy globally, but default was one availability zone).
Believe personally that Serverless computing has a strong future ahead wrt API development, and look forward to what cloud vendors software engineers/product managers add of new features, my wish list is:
Strong Python support
Built-in Thrift support and service discovery, as well as support for other RPC systems, e.g. gRPC, messagepack,++
Improved software tooling for automation (e.g. simplified SSL/domain name setup/maintenance handling – get deep partnerships with letsencrypt.com for SSL certificates?)
Zedge summer interns developed a very cool app using ARKit and CoreML (on iOS11). As parts of their journey the published 2 blog posts on the Zedge corporate web site related to:
How to develop and run Generative Adversarial Networks (GAN) for Creative AI on the iPhone using Apple’s CoreML tools, check out their blog post about it.
Deep Learning models (e.g. for GAN) can take a lot of space on a mobile device (tens of Megabytes to perhaps even Gigabytes), in order to keep initial app download size relatively low it can be useful to dynamically load only the models you need. Check out their blog post about various approaches for hotswapping CoreML models.
This blog post are my notes from project 3 in the term 1 of the Udacity Nanodegree in Self Driving cars. The project is about developing and training a convolutional neural network of camera input (3 different camera angles) from a simulated car.
Added normalization in the model itself (ref Lambda(lambda x: x/255.0 – 0.5, input_shape=img_input_shape)), since it is likely to be faster than doing it in pure Python.
Added Max Pooling after the first convolution layers, i.e. making the model a more “traditional” conv.net wrt being capable of detecting low level features such as edges (similar to classic networks such as LeNet).
Added Batch Normalization in early layers to be more robust wrt different learning rates
Used he_normal normalization (truncated normal distribution) since this type of normalization with TensorFlow has earlier mattered a lot
Made the model (much) smaller by reducing the fully connected layers (got problems running larger model on 1070 card, but in retrospect it was not the model size but my misunderstandings of Keras 2 that caused this trouble)
Used selu (ref: paper “Self-Normalizing Neural Networks” https://arxiv.org/abs/1706.02515) instead of relu as rectifier functions in later layers (fully connected) – since previous experience have shown (with traffic sign classification and tensorflow) showed that using selu gave faster convergence rates (though not better final result).
Used dropout in later layers to avoid overfitting
Used l1 regularization on the final layer, since I’ve seen that it is good for regression problems (better than l2)
The model was tested by running it through the simulator and ensuring that the vehicle could stay on the track. See modelthatworked.mp4 file in this github repository.
####3. Model parameter tuning
The model used an adam optimizer, so the learning rate was not tuned manually
####4. Appropriate training data
Used the training data that was provided as part of the project, and in addition added two runs of data to avoid problems (e.g. curve without lane line on the right side – until the bridge started and also a separate training set driving on the bridge). Data is available on https://amundtveit.com/DATA0.tgz).
###Model Architecture and Training Strategy
####1. Solution Design Approach
The overall strategy for deriving a model architecture was to use a conv.net, first tried the previous one I used for Traffic Sign detection based on LeNet, but it didn’t work (probably too big images as input), and then started with the Nvidia model (see above for details about changes to it).
In order to gauge how well the model was working, I split my image and steering angle data into a training and validation set. Primary finding was that numerical performance of the models I tried was not a good predictor of how well it would it perform on actual driving in the simulator. Perhaps overfitting could be good for this task (i.e. memorize track), but I attempted to get a correctly trained model without overfitting (ref. dropout/selu and batch normalization). There were many failed runs before the car actually could drive around the first track.
2. Creation of the Training Set & Training Process
I redrove and captured training data for the sections that were problematic (as mentioned the curve without lane lines on right and the bridge and part just before bridge). Regarding center-driving I didn’t get much success adding data for that, but perhaps my rebalancing (ref. generator output above) actually was counter-productive?
For each example line in the training data I generated 6 variants (for data augmentetation), i.e. flipped image (along center vertical axis) + also used the 3 different cameras (left, center and right) with adjustments for the angle.
After the collection process, I had 10485 lines in driving_log.csv, i.e. number of data points = 62430 (6*10485). Preprocessing used to flip image, convert images to numpy arrays and also (as part of Keras model) to scale values. Also did cropping of the image as part of the model. I finally randomly shuffled the data set and put 20 of the data into a validation set, see generator for details. Examples of images (before cropping inside model) is shown below:
Example of center camera image
Example of flipped center camera image
Example of left camera image
Example of right camera image
generator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# dealing with unbalanced data with class_weight in Keras
I used this training data for training the model. The validation helped determine if the model was over or under fitting. The ideal number of epochs was 5 as evidenced by the quick flattening of loss and validation loss (to around 0.03), in earlier runs validation loss increased above training loss when having more epochs. I used an adam optimizer so that manually training the learning rate wasn’t necessary.
1
2
3
4
5
6
7
8
9
10
11
Epoch1/5
178s-loss:0.2916-val_loss:0.0603
Epoch2/5
180s-loss:0.0530-val_loss:0.0463
Epoch3/5
181s-loss:0.0398-val_loss:0.0330
Epoch4/5
179s-loss:0.0309-val_loss:0.0326
Epoch5/5
178s-loss:0.0302-val_loss:0.0312
3. Challenges
Challenges along the way – found it to be a very hard task, since the model loss and validation loss weren’t good predictors for actual driving performance, also had cases when adding more training data with nice driving data (at the center and far from the edges) actually gave worse results and made the car drive off the road. Other challenges were Keras 2 related, the semantics of parameters in Keras 1 and Keras 2 fooled me a bit using Keras 2, ref the steps_per_epoch. Also had issues with the progress bar not working in Keras 2 in Jupyter notebook, so had to use 3rd party library https://pypi.python.org/pypi/keras-tqdm/2.0.1
This blog post has recent publications about use of Deep Learning in Energy Production context (wind, gas and oil), e.g. wind power prediction, turbine risk assessment, reservoir discovery and price forecasting.
This blog post is a (basic) approach of how to potentially use OpenCV for Lane Finding for self-driving cars (i.e. the yellow and white stripes along the road) – did this as one of the projects of term 1 of Udacity’s self-driving car nanodegree (highly recommended online education!).
Disclaimer: the approach presented in this blog post is way to simple to use for an actual self-driving car, but was a good way (for me) to learn more about (non-deep learning based) computer vision and the lane finding problem.
8. Did a hough (transform) image creation, I also modified the draw_lines function (see GitHub link above) by calculating average derivative and b value (i.e. calculating y = x-b for all the hough lines to find a and b, and then average over them).
(side note: believe it perhaps could have been smarter to use hough line center points instead of hough lines, since the directions of them seem sometimes a bit unstable, and then use average of derivatives between center points instead)
Python
1
2
3
4
5
6
7
8
rho=1# distance resolution in pixels of the Hough grid
theta=np.pi/180# angular resolution in radians of the Hough grid
threshold=1# minimum number of votes (intersections in Hough grid cell)
min_line_len=3#minimum number of pixels making up a line
max_line_gap=1# maximum gap in pixels between connectable line segments
line_image=np.copy(image)*0# creating a blank to draw lines on