GSoC谷歌编程之夏: Google带你玩转开源

今年夏天,很荣幸入选GSoC 2019, 和 SymPy 社区一起度过了一个超酷的暑假。刚刚也收到了纪念衬衫,回忆一下这个暑假的经历做个总结,也是推广一下这个活动,希望感兴趣的同学来年报名参加!

全文一共 127497894787915679字,预计阅读需要 N分钟。

I. 什么是 GSoC?

GSoC的全称是 Google Summer of Code, 中文翻译为编程之夏,是由Google组织的每年一度的开源程序设计项目。最早开始于2005年。该项目旨在鼓励在校学生(申请的必要条件之一就是身份为注册的在校学生,学历没有限制,从本科生到博士都可以)参与到开源社区中,在导师的指导下通过完成项目。活动一般由3月初开始申请,5月份出示结果并且正式开始做为期3个月的项目。项目整体分为三个小阶段,每个阶段一个月左右。阶段结束后会由导师评价是否完成阶段任务并继续下一阶段,直至项目完成。

长话短说,我总结了这个项目的几大特点,希望能概括地回答“为什么要参加”这个问题。

  • 可选择的开源社区非常丰富

根据我自己的不完全统计,2019年参与GSoC的社区数已经超过200个,涵盖的领域包括Operating systems(GNU, Fedora, etc),Prgramming Language(Python, R, JAVA, etc.), Science(CERN-HSF, Python, etc. ) 等等各种类型。我所参与的SymPy社区是一个Python的符号计算库。 偷偷地说,我们的目标,是要对标NumPy的哈哈哈。客观地说,相比于国内近几年也兴起的一些编程之夏项目,例如阿里巴巴编程之夏,GSoC的号召力是其暂时难以比拟的。所以只要热爱编程,我相信在GSoC一定能找到你的一亩三分地!

  • 项目的选择有极高的自由度

作为整个项目的主体,如何定义这三个月的编程任务显得格外重要。一方面,各个社区会在报名临近时在自己的论坛、官网、GSoC首页上放出一个项目列表,以供学生选择。另一方面,作为极高自由度的体现,GSoC允许学生自己提出项目并和社区中的前辈一起讨论并定义属于你自己的项目。100%私人定制,用得舒心,做得放心。所以只要你有心,一定能找到适合自己的项目。

  • 项目全程都有导师提供指导

为了避免放养式项目导致的烂尾、拖延,GSoC要求社区为每一个入选的项目分配至少一个导师(mentor),全程为学生提供帮助。导师不仅仅起到在遇到问题时提供建议的作用,一定程度上也会督促同学抓紧完成。同时,和导师的交流能让自己学到很多,不仅仅是写代码的工程能力,还有如何在社区中发现问题、参与讨论、解决问题也是一个学习的过程。另外,社区中卧虎藏龙,有一些导师可能就是行业的大牛。比如今年aimacode社区 (致力于《Artificial Intelligence: A Modern Approach》这本书的代码实现),有一个导师是该书的原作者Peter Norvig。我记得那个项目的申请人数直接爆炸了。不过出于好奇我刚刚确认了一下,Peter Norvig并没有作为mentor出现在今年确定的项目列表中,应该是项目没有入选。如果这是一个 aimacode 的营销的话,那真是十分成功了哈哈哈!

  • 丰富自己的简历

尽管这个项目还很年轻,但是从我在准备期间和几位前辈的交流来看,GSoC还是认可度很高的一个项目,知名度也在随着Google年复一年的推广而增长。撇去项目本身的光环,单单看这样一次参与到开源社区的经历,也是很为自己的简历添彩的。暑假期间,与其窝在家里当一个悠闲的快乐肥宅,不如拿起键盘敲敲代码,既能学到东西又能赚钱,何乐而不为?另外 今年 在项目结束之后,Google给了一个Connect账户,作为一个记录。在申请职位、内推的时候会突出自己的GSoC经历作为加分项。不过我可能太菜了,投的简历到现在还没有回音哈哈哈哈。

  • Google 提供工资

鉴于项目在暑假期间进行,很多同学会由于一定的经济压力不得不打暑假工。为了免去参与项目的学生的后顾之忧,GSoC为每个同学提供了一笔相当可观的工资。值得一提的是,工资的金额是以学生注册所在国家的生活消费水平为基准定义的。比如国内的同学,今年的奖金在3000$(金额是会发生变化的,据说往年更高)。因为我身在法国,基于法国消费水平,工资则为5400$。也算是薅了一把资本主义的羊毛哈哈哈哈。另外,奖金分为三个阶段以30%, 30%, 40%的比例分为三阶段发放,与上面提到的三个阶段对应。所以还是得好好干活,要不到手的钞票也有可能飞走!

II. GSoC 时间线

聊完了为什么要参加,是不是有些心动?那就来看看如何行动。我在下面贴出了2019年的时间线,鉴于每年的开始时间都会变动,仅仅作为一个大概时间阶段的参考。我将在下一章节,按照时间线的顺序,总结一下自己在今年申请过程。

III. 一些体会和建议

这一章节将结合我自己的经历和了解到的消息写成一个自我总结,希望能给来年要申请的同学一些经验和教训

  • 第一步:选好目标

正如时间线体现的,社区会比学生们先行一步,向GSoC报名是否参加该年的项目。注意这是一个双向的选择,一方面社区选择是否参加,另一方面GSoC也会筛选是否收录该社区。所以要选择感兴趣的社区之前最好了解一下社区有没有意向参与GSoC。最直接的方法就是在社区的论坛或者mailing list中询问一下社区是否参加。我建议也可以查阅一下往年参与的情况,如果一个社区连续几年都有参加的话,应该大概率会继续参加而且会被选中。选择社区不应贪多,合适的方向和数量最为重要。

  • 第二步:混个脸熟

尽管正式申请是在3月份左右才开始,但是准备工作应该早早就进行。可以从上一年年底或者该年年初就开始在社区里逛一逛,提一提issue,甚至有pull request被merged那就最好。一方面让社区里的人认识你,让导师有一个初步的印象,另一方面绝大部分社区都有一个申请的必要条件就是为社区做过contribution。这样也能保证你对社区有一个基本的了解。很多社区都会在issues里会为新人创建一个Label, 比如“Easy to fix” “Novice” 之类的,便于新成员通过fix 他们来熟悉开发的环境、流程。

然而我自己是在今年的3月份才了解到GSoC并且决定参加,所以当时确实太过仓促。了解整个项目,选择社区,做出contribution,都是赶鸭子上架,熬了好多夜才赶上了末班车。所以希望明年想要报名的同学越早开始越好。

  • 第三步:确定方向

到了2月底的时候,被选中的社区就会官宣了。同时也会有idea list放出来,即社区自己需提出的一些项目的建议。这时候就需要及时确定下来自己想要做什么方向了。值得注意的是每个同学只能提交最多三个申请,所以还是以准确、适合为主,不宜贪多。

  • 第四步:写好Proposal

Proposal是申请的最为关键的一步!!最为关键!!最为关键!!直接决定了被选中的概率。实际上,社区给的idea也往往只有一个大概的方向,接下来就需要学生通过自己对社区的了解,结合自己的能力,对于如何解决这个问题制定出一个完整的计划。这就是proposal的目的。如何写好一份proposal,网上有很多前辈总结了经验,我就不班门弄斧了。我的建议就是一定要落到实处,关于如何解决问题,空泛地吹牛是很难获得肯定的,而是要周密地分析,提出具体地方法。如果能做出一个小demo是特别加分的。

  • 第五步:多改改Proposal

写完第一稿Proposal可不能就拍拍手万事大吉,最好能发给项目的导师、社区里的伙伴,让他们提出建议、帮忙修改。另外如果在写proposal的时候遇到困难,比如有问题没有理解或者自己无法独立解决,最好是和导师沟通并请求他为你解释或者提供相关材料。我相信在多次修改之后,能看到自己的proposal会越来越具体,越来越可行,这就是被选中的好兆头。

  • 第六步:被选中的孩子们应该如何继续

提交完propsal之后,就是静候佳音啦。今年是5月初宣布的。天知道那天宣布被选中的时候我有多高兴哈哈哈哈哈!其实被选中才是GSoC的过程中最难的一步。我们社区的入选淘汰比也达到了8:1。如果是一些老牌社区,比例可能会更高。虽然之后每个阶段还有评估,但是只要踏踏实实做基本不会有问题。所以过了最难的一关,接下来是跟社区的bonding period。更深地了解社区,更仔细地阅读源码,加深自己对项目的了解,都是在为项目正式开始做准备工作

  • 第七步:每个阶段的评估

项目进行中,和导师的沟通很重要。当时的我遇上了时间安排的问题。GSoC的时间线是偏英美的,而身在法国的我在项目的第一阶段正好是期末的一个月。学业上的压力让我分身乏术。我在申请之前就跟导师有过说明,他表示前一个月可以允许我适当减少工作量,将来放假了补上即可。这让我成功度过了那魔鬼一个月。之后就是记得和导师沟通遇到的问题,以及一些项目内容的调整。当然后来,我又要同时准备托福考试,又一次分身乏术让我在项目的尾声做得不是特别满意,留下了一些遗憾。

  • 第八步:项目收尾

项目的收尾也很重要。如何把自己做的工作提交上去,如果写好report以供后来的人参考。社区和导师都会提供帮助,多多询问多多讨论即可。

到此,整个项目就完成啦!之后会收到一系列的恭喜邮件、最后一笔工资以及纪念衬衫。一个充实的暑假也就过完了。

IV. 总结

整体而言,这个项目给我带来了很多收获,第一次开源社区的参与经历,和导师的交流和讨论让我写代码的能力有所长进,项目期间解决问题完成任务也是很好的学习过程。GSoC在印度的知名度远高于国内,根据今年的情况,有一半的项目都被印度同学拿到了!可以看到印度学生确实在工程能力上表现优秀。希望国内有更多的同学能够了解并且参与到这个项目中,有所收获有所增益!

最后晒一下纪念衬衫哈哈哈!希望感兴趣的同学明年申请顺利!

[GSoC 2019] Week 10!

A lot of modifications have been made to the PR #17308 so that the functionality can be well implemented and the code could be clean and efficient.

The PR is merged to the master branch.

[GSoC 2019] WEEK 9!

I am so happy to pass the second evaluation!

This week is dedicated to the implementation of new iteration functionalities in Array module. Since the behaviour of __getitem__ is changed last week, we need to find a way to replace the old way of iterating arrays.

For this purpose, a generator is implemented to enable the iteration over each element, which was the old way in SymPy to iterate the array. This functionality is equivalent to flattening the array and then visiting one by one these elements, so the new class is named as Flatten. The advantage of generator will contribute to less memeory cose while iterating. This implementation corresponds as well the lazy-evaluation that I planned to implement in my proposal.

So the implementation is in #17308

[GSoC 2019] Week 8!

This is the week of second evaluation.

I have been working on a new issue, which is about modifying the behavior of Array module to follow the example of NumPy. As I discussed with my mentor at the very beginning of the project, the Array module is newly created and has fewer users at present, so it is still possible to standardize the behaviour of this module. Following a mature library such as NumPy will be a good idea because it would not only increase the consistency between Python’s libraries but also increase the efficiency of code in many cases.

The first step is modifying __getitem__ this function. What we had before is to access the value with a absolute index in the flatten array, while the standard in NumPy is that a index that is not complete will be added slice of `:` to extract all available values.

In order to do so, an new PR is opened and merged: #17226

[GSoC 2019] WEEK 7!

This week is also dedicated to the implementation of new format for sparse array. #17160 I have figured out the way to dynamically handle the dimension problem for different formats.

The most difficult part is for Compressed Sparse Row (CSR) format. This format is used by Matlab and Julia as a default format. (In fact the exact foramt is Compressed Sparse Column, which is transposed storage for CSR format.) CSR format used 3 list for storage: a flattened list for all non-zero value, a list of column index for each non-zero value and a list of pointer for first non-zero value in each row. In a two-dimension matrix, the notion of row is straight forward. But how to define a row in a multidimensional array?

In order solve this problem, I have defined the last dimension of each array as a row and flattened the first n-1 dimensions, with n as the rank of the array. I this way, a sparse array can be stored in 3 lists as the way CSR proposes.

This format is said to have a very good performance in item accessing and slicing, as well as vector product. Unfortunately, I haven’t figured out how to implement the __mul__ function to make good use of this format.

By the end of this week, 3 formats are basically implemented: List Of List(LIL), Coordiante list(COO) and Compressed Sparse Row(CSR).

[GSoC 2019] WEEK 6!

Hello everyone, this week I have started some preparation work for new formats of sparse array. The default format for sparse array in SymPy is Dictionary of keys(DOK) which means that the non-zero values are stored in a dictionary with the index as key and non-zero value as value. But this format has some disadvantages, i.e. very slow iterations due to random order of keys. So in order to improve the performance for some Array operations, I started thinking about infroducing new formats to sparse array module.

The format are widely used in matrix but not in array. Staying in tow dimension matrix makes some special format straight forward: Coordinate list (COO) for example, has one list for non-zero values and two more for columns and rows. But how to deal with the case of Array where the dimension(or rank) is variant is one obstacle.

So this week I have been mostly searching for solutions in order to dynamically handle the dimension for each format.

A new PR is opened to firstly implement the algorithm of each format. Link And a basic implementation is done to show the structure of LIL format.

As a finalization of last phase, most of sparse array casting into dense array issues have been resolved.

[GSoC 2019] WEEK 4 and 5!

Hello, the first phase is ended and I am happy to pass the first evaluation. I was struggling with my academic projects and final exams during the last two weeks. After talking about my difficulty of spending time contributing on my project with my mentors, Francesco allowed me to have a one-week break in condition that I should make up one week in the next phases. The goal is to have 40 hours work per week on average by the end of this program.

Thanks to the comprehension of my mentor, I could successfully pass the exams. I am going to work more over the second phase in order to have more contributions to the community. 🙂

[GSoC 2019] WEEK 3!

Hello, the third week ends! I am a little bit too busy to keep contributing on my GSoC project as the first two weeks. Fortunately, thank to the work of the past 2 weeks and the help of my mentors, I become more productive which compensates, more or less, the lack of time.

  • Overview

The task about sparse arrays nearly comes to an end which is the main goal for the first phase. I still have one case to be fixed but most of them are merged in to Sympy’s master and are ready to be used. I started as well preparing for the next task: implementing more data structure for sparse arrays. This would be started in the next week and is supposed to be ended by the begining of phase 2.

  • PRs

Several PR are merged this week for sparse arrays operations. #16937 is finally merged with a code style much simpler than its first version. #16994 about __mul__ and __rmul__ is closed and replaced by #17014 due to a mistake about rebase. #17026 about __div__ and __neg__ operators for sparse arrays is opened and merged. #17035 about permutedims and transpose for sparse arrays is opened and reviewed by my mentor.

  • Issues

No specific issue opened. The list in #16941 is continously being updated.

  • Reviews

No review due to the lack of time.

  • Related works

Pratice of rebasing and merging for a specific case: updated local repository from the upstream master after opening a PR. Even though I have broken the PR #16994, I am still happy to learn some new tricks that I didn’t know before.

Preparing a blog about presenting sparse array associated with my work in SymPy so that my contribution can indeed help people when they need to use this class but don’t really know how to do it and what the advantage would be .

[GSoC 2019] WEEK 2!

Hello everyone! I am glad to present some of my progress during this past week.

  • Overview

I was mainly working on the sparse aray issue over this week. A sparse array is, by a brief definition, an array in which most of the elements are zero. In order to save the memory cost, sparse array is distinguished from the dense array(which means most of elements are non-zero) and then is stored and manipulated in a different way. The default data format in SymPy for sparse array is called Dictionary of keys (DOK), where the non-zero value is stored in a dictionary.

However, the sparse array is cast to dense array in various ways. So what I am trying to do is to free these sparse arrays and nake good use of them for cases like large scale sparse array operations.

  • PRs

The PR about derive_by_array #16937 is not yet merged. Thank to the guidance of my mentor Francesco, the PR is being ameliorated and now has satisfactory result. Besides the operation of derive_by_array, I have updated as well the way Array module perform a equality test and apply function. To show the result of these changes, let’s see what the behavior of derive_by_array was:

>>> a = MutableSparseNDimArray.zeros(10000, 20000)
>>> a[1, 1] = i
>>> a[1, 2] = j
>>> d = derive_by_array(a, i)
MemoryError
>>> d = derive_by_array(a, (i, j))
MemoryError

The sparse array is cast to a list which leads to the MemoryError. Now, we can have:

>>> a = MutableSparseNDimArray.zeros(10000, 20000)
>>> a[1, 1] = i
>>> derive_by_array(a, i) = ImmutableSparseNDimArray({20001:1},(10000, 20000))
True
>>> a[1, 2] = j
>>> derive_by_array(a, (i, j)) == ImmutableSparseNDimArray({20001: 1, 200020002: 1},(2, 10000, 20000))
True

The operation is much more rapid and can save a lot of spaces of memory.

Francesco also helped to improve the quality of codes. For example, to perform the deriative of a sparse array, he suggested me to use a dictionary comprehension instead of a for loop, so an extra function is replaced by one single line of code.

A new PR #16994 about __mul__ and__rmul__ is opened, which solves the problem of casting in these operators. Same as another new PR #17000 for tensorproduct. However, these PRs cannot pass the test for now because the issue about equality test is not yet fixed and merged into master of SymPy.

A new PR #16969 about adding functions and tests for ArrayComprehension is merged, which added functions like .tomatrix(), .tolist() for this class.

  • Issues

I have opened a issue #16984 about equality test of Dict and dict objects. I ran into this case while comparing the dictionaries of sparse array. Fortunately, this issue is recently fixed.

  • Reviews

No reviews for other contributors’ PR. I have participated in the review of my own PRs since there are some errors have been found after the creation of PR.

  • Related works

I have participated in the discussion about the nature of ArrayComprehension in the community this week. Comprehension is a new class that I implemented last week(See last post for more details) This discussion helped me better understand the difference between a multipledimensional list and an array: an array is suppposed to perform vectorized operations while list is mosyly element-wise.