蓄水池抽样

int32位 posted @ Mar 22, 2016 04:03:09 PM in algorithm , 2478 阅读
转载请注明:http://krystism.is-programmer.com/若有错误,请多多指正,谢谢!

什么是蓄水池抽样,它能解决什么问题?

从一次面试说起

百度面试以算法为主啊,手动写代码。第一道题是实现c语言库函数strcpy,这个原理很简单,但要注意以下这几点:

  • 空指针检查(包括src和dest)
  • 内存重叠,要检查指针是否重叠
  • 最后拷贝时,别忘了在dest追加字符串终结符号0
  • 如何保证dest已分配足够内存
  • 为什么从后向前拷贝?

第二道题是写一个类,实现堆的操作。说实话,虽然堆的操作不难,但要真正实现它并不容易。以下需要注意的点:

  • 泛型参数(泛化编程)
  • 类型必需可比较或者提供比较器
  • 不能实例泛型类型,即不能new T[]或者new T(),只能使用Object数组
  • 内存动态分配,内存拷贝(数组拷贝)

重点是第三道题目:给定一个文件,从中随机取出n行,并计算时间复杂度。我上来便是这样的代码:

import random
def sample(filename, n = 3):
    assert(filename is not None)
    result = []
    with open(filename) as f:
        lines = f.readlines()
    while len(result) < n :
        line = random.choice(lines)
        if line not in result:
            result.append(line.strip())
    return result

这种方法

  • 必须把文件内容全读入内存,如果文件很大怎么办?
  • 设N为文件行数,n和N接近时,越到后面,冲突越大,效率极低。
  • 有人说,先随机生成n个数构成一个集合A(还是有冲突哦),读取文件,当行数属于A时,取出该行,问题不就解决了? 问题是文件总行数是多少?必须知道文件总行数才能知道随机数的取值范围啊。必须先扫描一遍文件?文件很大时这样的效率如何?

后来面试官问我知不知道分治法,我说了解,但这有关系么?文件分割建索引?全错!!!

蓄水池抽样

蓄水池抽样(Reservoir Sampling )是一个很有趣的问题,它能够在o(n)时间内对n个数据进行等概率随机抽取, 例如:从1000个数据中等概率随机抽取出100个。另外,如果数据集合的量特别大或者还在增长(相当于未知数据集合总量), 该算法依然可以等概率抽样。

以上摘自handspeaker博客

算法步骤为:

  1. 从文件中取前n行,结果集result
  2. 令i表示当前行数,c为当前行内容,m = random(1, i),即m为1~i的一个随机数。
  3. 若m <= n, 则令result[m] = c,即替换第m行内容,否则舍弃c。
  4. 若已到文件结尾,退出算法,返回result,否则转2

利用归纳法可以证明每行被取的概率是相等的,证明过程 见handspeaker博客

回到问题

问题解决了,毫无悬念!

import random
def sample(filename, n = 5):
    assert(filename is not None)
    result = []
    with open(filename, mode = "r") as f:
        for i in range(n):
            line = f.readline()
            if line == '':
                return result
            else:
                result.append(line.strip())
        i = n
        for line in f:
            k = random.randint(0, i)
            if k < n:
                result[k] = line.strip()
            i += 1
    return result

 

转载请注明:http://krystism.is-programmer.com/若有错误,请多多指正,谢谢!
  • 无匹配
  • 无匹配
Haryana Board Model 说:
2022年9月03日 18:01

Haryana Board Model Paper 2023 Class 4 Pdf Download with Answers for English Medium, Hindi Medium, Urdu Medium & Students for Small Answers, Long Answer, Very Long Answer Questions, and Essay Type Questions to Term1 & Term2 Exams at official website. Haryana Board Model Paper Class 4 New Exam Scheme or Question Pattern for Sammittive Assignment Exams (SA1 & SA2): Very Long Answer (VLA), Long Answer (LA), Small Answer (SA), Very Small Answer (VSA), Single Answer, Multiple Choice and etc.

SAAD 说:
2022年12月27日 18:19

Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article. buy shrooms online

Bushra 说:
2022年12月29日 14:03

very interesting post.this is my first time visit here.i found so mmany interesting stuff in your blog especially its discussion..thanks for the post! wildlife removal Newmarket

SAAD 说:
2022年12月30日 18:10 Wonderful article, thanks for putting this together! This is obviously one great post. Thanks for the valuable information and insights you have so provided here. Meri Saas Bhoot Hai
SAAD 说:
2023年1月06日 19:09

Wow what a Great Information about World Day its very nice informative post. thanks for the post. kelowna movers

milan 说:
2023年1月10日 22:15

Reservoir sampling is a technique for randomly selecting a subset of items from a larger population. It is often used when it is impractical to sample the entire population, or when best online engagement rings store the population is too large to store in memory. To perform reservoir sampling, we first select an item at random from the population and add it to the reservoir. Then, for each subsequent item in the population, we select it with probability proportional to its size. This ensures that each item in the population has an equal chance of being selected.

Bushra sheikh 说:
2023年7月28日 14:01 thank you for a great post. Learn more
Bushra sheikh 说:
2023年7月28日 19:33

Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article. Pressure cleaning Sunshine Coast

Bushra sheikh 说:
2023年7月28日 22:09

Thanks for your information, it was really very helpfull.. Toowoomba Concreting Solutions QLD

Bushra batool 说:
2023年8月12日 16:10

Thanks for the post and great tips..even I also think that hard work is the most important aspect of getting success.. Arif Bhalwani

Bushra batool 说:
2023年8月12日 21:11

thank you for a great post. challenger freight

Bushra batool 说:
2023年8月13日 00:44

Thanks for the post and great tips..even I also think that hard work is the most important aspect of getting success.. IRS Tax Attorney

Bushra batool 说:
2023年8月13日 04:23

I really appreciate this wonderful post that you have provided for us. I assure this would be beneficial for most of the people. party rentals

Bushra batool 说:
2023年9月06日 14:48

Editing software like Adobe Lightroom and Photoshop allows photographers to enhance and refine their photos. Adjustments to exposure, color, and sharpness can transform an ordinary image into an extraordinary one. most famous photographers

Bushra batool 说:
2023年9月07日 15:55 Measure the backset, which is the distance from the door's edge to the center of the borehole. Common backset measurements are 2⅜ inches or 2¾ inches. Locksmith Whitstable
Bushra batool 说:
2023年9月08日 16:40

I really appreciate the kind of topics you post here. Thanks for sharing us a great information that is actually helpful. Good day! website link

shaikhseo 说:
2023年9月10日 13:12

thanks this is good blog. Law tutors in London

Naveed 说:
2023年9月17日 17:04

I know your expertise on this. I must say we should have an online discussion on this. Writing only comments will close the discussion straight away! And will restrict the benefits from this information. here are the findings

Naveed 说:
2023年9月17日 19:25

This is a great inspiring article.I am pretty much pleased with your good work.You put really very helpful information... More Bonuses

Bushra batool 说:
2023年9月22日 22:41

Photography has evolved significantly over time, from the early days of black and white film to the digital age, where images can be instantly captured, edited, and shared online. christmas photos calgary


登录 *


loading captcha image...
(输入验证码)
or Ctrl+Enter