蓄水池抽样

int32位 posted @ Mar 22, 2016 04:03:09 PM in algorithm , 2376 阅读
转载请注明:http://krystism.is-programmer.com/若有错误,请多多指正,谢谢!

什么是蓄水池抽样,它能解决什么问题?

从一次面试说起

百度面试以算法为主啊,手动写代码。第一道题是实现c语言库函数strcpy,这个原理很简单,但要注意以下这几点:

  • 空指针检查(包括src和dest)
  • 内存重叠,要检查指针是否重叠
  • 最后拷贝时,别忘了在dest追加字符串终结符号0
  • 如何保证dest已分配足够内存
  • 为什么从后向前拷贝?

第二道题是写一个类,实现堆的操作。说实话,虽然堆的操作不难,但要真正实现它并不容易。以下需要注意的点:

  • 泛型参数(泛化编程)
  • 类型必需可比较或者提供比较器
  • 不能实例泛型类型,即不能new T[]或者new T(),只能使用Object数组
  • 内存动态分配,内存拷贝(数组拷贝)

重点是第三道题目:给定一个文件,从中随机取出n行,并计算时间复杂度。我上来便是这样的代码:

import random
def sample(filename, n = 3):
    assert(filename is not None)
    result = []
    with open(filename) as f:
        lines = f.readlines()
    while len(result) < n :
        line = random.choice(lines)
        if line not in result:
            result.append(line.strip())
    return result

这种方法

  • 必须把文件内容全读入内存,如果文件很大怎么办?
  • 设N为文件行数,n和N接近时,越到后面,冲突越大,效率极低。
  • 有人说,先随机生成n个数构成一个集合A(还是有冲突哦),读取文件,当行数属于A时,取出该行,问题不就解决了? 问题是文件总行数是多少?必须知道文件总行数才能知道随机数的取值范围啊。必须先扫描一遍文件?文件很大时这样的效率如何?

后来面试官问我知不知道分治法,我说了解,但这有关系么?文件分割建索引?全错!!!

蓄水池抽样

蓄水池抽样(Reservoir Sampling )是一个很有趣的问题,它能够在o(n)时间内对n个数据进行等概率随机抽取, 例如:从1000个数据中等概率随机抽取出100个。另外,如果数据集合的量特别大或者还在增长(相当于未知数据集合总量), 该算法依然可以等概率抽样。

以上摘自handspeaker博客

算法步骤为:

  1. 从文件中取前n行,结果集result
  2. 令i表示当前行数,c为当前行内容,m = random(1, i),即m为1~i的一个随机数。
  3. 若m <= n, 则令result[m] = c,即替换第m行内容,否则舍弃c。
  4. 若已到文件结尾,退出算法,返回result,否则转2

利用归纳法可以证明每行被取的概率是相等的,证明过程 见handspeaker博客

回到问题

问题解决了,毫无悬念!

import random
def sample(filename, n = 5):
    assert(filename is not None)
    result = []
    with open(filename, mode = "r") as f:
        for i in range(n):
            line = f.readline()
            if line == '':
                return result
            else:
                result.append(line.strip())
        i = n
        for line in f:
            k = random.randint(0, i)
            if k < n:
                result[k] = line.strip()
            i += 1
    return result

 

转载请注明:http://krystism.is-programmer.com/若有错误,请多多指正,谢谢!
  • 无匹配
  • 无匹配
Haryana Board Model 说:
2022年9月03日 18:01

Haryana Board Model Paper 2023 Class 4 Pdf Download with Answers for English Medium, Hindi Medium, Urdu Medium & Students for Small Answers, Long Answer, Very Long Answer Questions, and Essay Type Questions to Term1 & Term2 Exams at official website. Haryana Board Model Paper Class 4 New Exam Scheme or Question Pattern for Sammittive Assignment Exams (SA1 & SA2): Very Long Answer (VLA), Long Answer (LA), Small Answer (SA), Very Small Answer (VSA), Single Answer, Multiple Choice and etc.

SAAD 说:
2022年12月27日 18:19

Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article. buy shrooms online

Bushra 说:
2022年12月29日 14:03

very interesting post.this is my first time visit here.i found so mmany interesting stuff in your blog especially its discussion..thanks for the post! wildlife removal Newmarket

SAAD 说:
2022年12月30日 18:10 Wonderful article, thanks for putting this together! This is obviously one great post. Thanks for the valuable information and insights you have so provided here. Meri Saas Bhoot Hai
SAAD 说:
2023年1月06日 19:09

Wow what a Great Information about World Day its very nice informative post. thanks for the post. kelowna movers

milan 说:
2023年1月10日 22:15

Reservoir sampling is a technique for randomly selecting a subset of items from a larger population. It is often used when it is impractical to sample the entire population, or when best online engagement rings store the population is too large to store in memory. To perform reservoir sampling, we first select an item at random from the population and add it to the reservoir. Then, for each subsequent item in the population, we select it with probability proportional to its size. This ensures that each item in the population has an equal chance of being selected.


登录 *


loading captcha image...
(输入验证码)
or Ctrl+Enter